Hacker News new | ask | show | jobs
by anon4 4584 days ago
I know someone will sooner or later propose that we ban spaces and special characters in names. Let me just put my two cents forward.

We should absolutely ban special characters from names. Specifically, all whitespace, the colon, semicolon, forward slash, backward slash, question mark, star, ampersand, and whatever else I'm missing that will confuse the shell. Also files cannot start with a dash.

However, people should be able to name files with these characters. So I propose that these characters in filenames be percent-encoded like they would be in a URL. Specifically, the algorithm should be

1. Take the file name and encode it as UTF-8. Enforce some sort of normalization.

2. Substitute each problematic byte with equivalent percent-encoded form. This does not touch bytes over 0x80 - they are assumed non-problematic.

3. Write the file in the file system under that name.

4. When displaying files, run the algorithm in reverse.

In the general case files like "01 - Don't Eat the Yellow Snow.mp3" would simply become 01%20-%20Don't%20Eat%20the%20Yellow%20Snow.mp3 in the filesystem and cause absolutely no further problems. To make it completely backwards-compatible we should also add the following rule: If a filename includes a problematic byte or a percent-encoded byte higher than 0x80, then it is assumed to be raw and will not undergo percent decoding.

Basically, I propose that every program which receives free text input for a file name percent-encode the filenames before writing them to the filesystem and decode them for display. Everything else remains unchanged.

Why this will not work:

Requiring programmers to keep track of two filenames instead of just one is rather a lot of work. File APIs will have to take both encoded and non-encoded forms and encode the non-encoded form, creating problems when people inadvertently use the wrong function with a name, either double-encoding it or not encoding it and leading to "this file does not exist" errors.

It will be possible to create two files with different names on disk which are nonetheless shown with the same name to the user.

Why it is ugly:

We're taping over a deficiency of an ancient language by inflicting pain on programmers.

Double-encoded filenames? MADNESS.

Why I like it:

I'll be able to have ?, * and : in filenames in windows.

My shell scripts will be much simpler.

What do you guys think?

1 comments

> Substitute each problematic byte with equivalent percent-encoded form. This does not touch bytes over 0x80 - they are assumed non-problematic.

You know what's crazy? Currently, in Unix, control characters are allowed in filenames. Like, \t and \n and \b and even \[. Those shouldn't be allowed, percent-escaped or not. Everything else you said is sensible.

Technically NTFS allows those too. The filesystem, being a very low-level tool, hardly thinks of the upper layers and what pain it might inflict there. Its purpose is to store blobs under a name and retrieve them upon request. Since a char[] (or wchar_t[]) looks enough like a name that's what it uses.

That being said, enforcing such restrictions in upper layers brings pain as well, because suddenly you can have files that you cannot delete anymore (happens sometimes on Windows).

True; there's no reason that the filesystem should be storing anything other than char[]. The filesystem is a serialized domain, and char[] buffers are for storage and retrieval of serialized data. But that also means that each filesystem should explicitly specify a serialization format for what's stored in that char[] -- hopefully UTF-8.

However, the filesystem should really be where that serialized representation begins and ends. The filesystem should be interacting with the VFS layer using runes (Unicode codepoints), not octets.

And then, given that all filesystems route through the VFS, it can (and should) be enforcing preconditions on those runes in its API, expecting users to pass it something like a printable_rune_t[]. (Or even, horror of Pascalian horrors, a struct containing a length-prefixed printable_rune_t[].)

And for the situation where there's now files floating around without a printable_rune_t[] name -- this is why NTFS has been conceptually based around GUIDs (really, NT object IDs) for a decade now, with all names for a file just being indexed aliases. I wonder when Linux will get on that train...

Well, history sadly dictates that the interface to the upper layers it based around code units because those have always been fixed-length. Unicode came to late to most operating systems to really be ingrained in their design and where it was (Windows springs to mind) it all got a turn for the worse with the 16-to-21-bit shift in Unicode 2 with Unicode-by-default systems being no better than 8-bit-by-default systems had been a decade earlier.

That NTFS uses GUIDs internally to reference streams is news to me, though. But I think on Unix-like systems the equivalent would be inodes, I guess, right?

Percent-escaped control characters are fine. They'll just show up in a gui file manager as a <?> symbol, and on the shell as %02 for instance. The shell will never parse percent-encoded filenames and gui file managers don't interpret control characters.

Non-percent-encoded control chars are strictly verboten. The VFS layer should contain a ban list of bytes (or codepoints) not allowed as part of a filename. It won't be a large list, just every nonprintable character from ASCII, every blank character (space, tab, newline, vertical tab, carriage return etc.), the characters /\;:?* - and that's it. This list should cover everything that might be problematic in windows OR linux OR MacOS. For full compatibility, we must also add the %uxxxx and %Uxxxxxxxx percent escapes for arbitrary unicode codepoints (I can sense that it might make sense to also escape all the unicode spaces, combining characters and the like, to make file manipulation from the shell easier).

It sounds sort of sensible, but we're dealing with two layers of encoding here, leading to three byte sequences.

1. You have a string the user entered. That's just a generic name which can be anything.

2. You take that string and substitute "problematic" characters with their percent encoded form. For example, every space becomes %20, non-breakable space might become %ua0, or it might be left alone

3. You now have a string of unicode codepoints, which are all "clean". This is encoded yet again to a sequence of bytes that are stored by the filesystem.

At least the second coding is done by the system, either by the standard file manipulation routines, or by the filesystem itself.

But it is the first one that seems infeasible. It has to be done at a layer above the standard "open" function and I can see developers being very confused on what and how to escape.

You know, maybe the answer might be not to have every other program do all this complicated dancing, but for the shell itself to escape filenames when it reads them. So when you say "cmd file%20with%20space", cmd is called with argument one set to "file with space". And when ls or find lists files, bad characters can be replaced with their percent-encoded forms. And xargs can unescape them.

I'll need to think about it some more again.