| I know someone will sooner or later propose that we ban spaces and special characters in names. Let me just put my two cents forward. We should absolutely ban special characters from names. Specifically, all whitespace, the colon, semicolon, forward slash, backward slash, question mark, star, ampersand, and whatever else I'm missing that will confuse the shell. Also files cannot start with a dash. However, people should be able to name files with these characters. So I propose that these characters in filenames be percent-encoded like they would be in a URL. Specifically, the algorithm should be 1. Take the file name and encode it as UTF-8. Enforce some sort of normalization. 2. Substitute each problematic byte with equivalent percent-encoded form. This does not touch bytes over 0x80 - they are assumed non-problematic. 3. Write the file in the file system under that name. 4. When displaying files, run the algorithm in reverse. In the general case files like "01 - Don't Eat the Yellow Snow.mp3" would simply become 01%20-%20Don't%20Eat%20the%20Yellow%20Snow.mp3 in the filesystem and cause absolutely no further problems. To make it completely backwards-compatible we should also add the following rule: If a filename includes a problematic byte or a percent-encoded byte higher than 0x80, then it is assumed to be raw and will not undergo percent decoding. Basically, I propose that every program which receives free text input for a file name percent-encode the filenames before writing them to the filesystem and decode them for display. Everything else remains unchanged. Why this will not work: Requiring programmers to keep track of two filenames instead of just one is rather a lot of work. File APIs will have to take both encoded and non-encoded forms and encode the non-encoded form, creating problems when people inadvertently use the wrong function with a name, either double-encoding it or not encoding it and leading to "this file does not exist" errors. It will be possible to create two files with different names on disk which are nonetheless shown with the same name to the user. Why it is ugly: We're taping over a deficiency of an ancient language by inflicting pain on programmers. Double-encoded filenames? MADNESS. Why I like it: I'll be able to have ?, * and : in filenames in windows. My shell scripts will be much simpler. What do you guys think? |
You know what's crazy? Currently, in Unix, control characters are allowed in filenames. Like, \t and \n and \b and even \[. Those shouldn't be allowed, percent-escaped or not. Everything else you said is sensible.