Hacker News new | ask | show | jobs
by derefr 4591 days ago
> Substitute each problematic byte with equivalent percent-encoded form. This does not touch bytes over 0x80 - they are assumed non-problematic.

You know what's crazy? Currently, in Unix, control characters are allowed in filenames. Like, \t and \n and \b and even \[. Those shouldn't be allowed, percent-escaped or not. Everything else you said is sensible.

2 comments

Technically NTFS allows those too. The filesystem, being a very low-level tool, hardly thinks of the upper layers and what pain it might inflict there. Its purpose is to store blobs under a name and retrieve them upon request. Since a char[] (or wchar_t[]) looks enough like a name that's what it uses.

That being said, enforcing such restrictions in upper layers brings pain as well, because suddenly you can have files that you cannot delete anymore (happens sometimes on Windows).

True; there's no reason that the filesystem should be storing anything other than char[]. The filesystem is a serialized domain, and char[] buffers are for storage and retrieval of serialized data. But that also means that each filesystem should explicitly specify a serialization format for what's stored in that char[] -- hopefully UTF-8.

However, the filesystem should really be where that serialized representation begins and ends. The filesystem should be interacting with the VFS layer using runes (Unicode codepoints), not octets.

And then, given that all filesystems route through the VFS, it can (and should) be enforcing preconditions on those runes in its API, expecting users to pass it something like a printable_rune_t[]. (Or even, horror of Pascalian horrors, a struct containing a length-prefixed printable_rune_t[].)

And for the situation where there's now files floating around without a printable_rune_t[] name -- this is why NTFS has been conceptually based around GUIDs (really, NT object IDs) for a decade now, with all names for a file just being indexed aliases. I wonder when Linux will get on that train...

Well, history sadly dictates that the interface to the upper layers it based around code units because those have always been fixed-length. Unicode came to late to most operating systems to really be ingrained in their design and where it was (Windows springs to mind) it all got a turn for the worse with the 16-to-21-bit shift in Unicode 2 with Unicode-by-default systems being no better than 8-bit-by-default systems had been a decade earlier.

That NTFS uses GUIDs internally to reference streams is news to me, though. But I think on Unix-like systems the equivalent would be inodes, I guess, right?

Percent-escaped control characters are fine. They'll just show up in a gui file manager as a <?> symbol, and on the shell as %02 for instance. The shell will never parse percent-encoded filenames and gui file managers don't interpret control characters.

Non-percent-encoded control chars are strictly verboten. The VFS layer should contain a ban list of bytes (or codepoints) not allowed as part of a filename. It won't be a large list, just every nonprintable character from ASCII, every blank character (space, tab, newline, vertical tab, carriage return etc.), the characters /\;:?* - and that's it. This list should cover everything that might be problematic in windows OR linux OR MacOS. For full compatibility, we must also add the %uxxxx and %Uxxxxxxxx percent escapes for arbitrary unicode codepoints (I can sense that it might make sense to also escape all the unicode spaces, combining characters and the like, to make file manipulation from the shell easier).

It sounds sort of sensible, but we're dealing with two layers of encoding here, leading to three byte sequences.

1. You have a string the user entered. That's just a generic name which can be anything.

2. You take that string and substitute "problematic" characters with their percent encoded form. For example, every space becomes %20, non-breakable space might become %ua0, or it might be left alone

3. You now have a string of unicode codepoints, which are all "clean". This is encoded yet again to a sequence of bytes that are stored by the filesystem.

At least the second coding is done by the system, either by the standard file manipulation routines, or by the filesystem itself.

But it is the first one that seems infeasible. It has to be done at a layer above the standard "open" function and I can see developers being very confused on what and how to escape.

You know, maybe the answer might be not to have every other program do all this complicated dancing, but for the shell itself to escape filenames when it reads them. So when you say "cmd file%20with%20space", cmd is called with argument one set to "file with space". And when ls or find lists files, bad characters can be replaced with their percent-encoded forms. And xargs can unescape them.

I'll need to think about it some more again.