| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amptorn 1637 days ago
	Why in the world does Unix allow newlines in a filename in the first place? That's just such an obviously brain-damaged idea. There's not a single rational use case for it, yet it breaks nearly every text-based tool you could possibly imagine...

10 comments

marcosdumay 1637 days ago

Why would Unix go and add random restrictions to filenames?

And what text protocol requires you to just insert user data without escaping or re-encoding? That looks badly broken. The kind of broken that will give your entire system to a hacker for encrypting and demanding ransom.

link

jagrsw 1637 days ago

> yet it breaks nearly every text-based tool you could possibly imagine

It breaks badly designed text protocols - some can argue that it's a good idea - "crash early, crash loud" etc.

Also if your protocol breaks with newlines, it probably breaks with other non-literals - brackets, quotes, NUL-bytes, control characters, carriage return char, multibyte chars etc etc.

link

wutbrodo 1637 days ago

> It breaks badly designed text protocols - some can argue that it's a good idea - "crash early, crash loud" etc

This is decisively not a case of "fail loudly", which I agree is generally a good idea. The very first example in the article is one of silent incorrect/ambiguous output, not loud failure.

link

bayindirh 1637 days ago

I'm against limiting the character set allowed for file names. macOS is also in the same boat with Linux, going one step forward and allowing \null terminator even in the filenames.

If we're going to limit filenames' character sets, I can offer a simpler solution:

Why allow file names? OS should provide a UUID for all files. No names, nothing. We can just write which file is what to another file, noting its UUIDs to sticky notes.

link

dragonwriter 1637 days ago

> Why allow file names? OS should provide a UUID for all files. No names, nothing. We can just write which file is what to another file, noting its UUIDs to sticky notes.

But... isn't that what filesystems, in effect, already do? Files have IDs, which are mapped to names in a separate record. Having it in one common shared place for the whole filesystem, and a common OS API that provides access to it for all mounted filesystems, just makes things like useful, user-friendly shells (graphical and text), and common controls possible without everything user-facing needed separate UI constructed from scratch for each apps files.

link

8organicbits 1637 days ago

Is there a userspace command like `ls` that lists files in a folder by those IDs?

link

mustache_kimono 1637 days ago

Um, 'ls -i'?

link

abofh 1637 days ago

'ls -i'

link

feldrim 1637 days ago

This is an old solution to a problem that does not exist. Yes, in that case the file system can be a key-value store. It would eliminate the need for a tree structure. But the tree structure has a meaning: it adds context. The directories are containers of files that adds a semantic abstraction to the files within.

https://devblogs.microsoft.com/oldnewthing/20110228-00/?p=11...

link

wlib 1637 days ago

Why do we impose hierarchy so much in file systems? We already allow hard and soft links, so it’s not even a tree anyways. Why not just allow any reference types you want; no name with extensions, but a set of tags. Why not identify files the same way a graph database query identifies nodes?

link

sitharus 1637 days ago

Because hierarchical structures and names are easy to explain to most people. macOS has supported tagging for ages, but I’ve never seen it used extensively or as a complete alternative to tree structure.

link

feldrim 1637 days ago

So you propose a graph database for data structures, without the persistence layer provided by the file system, right?

link

dahfizz 1637 days ago

Relative paths are extremely useful. Every user gets their own .bashrc and they don't have to fully qualify it to open the file

link

gglitch 1637 days ago

I’m with you on the directory tree, but like the idea of files having both names and unique, autogenerated IDs.

Edit: optionally having IDs.

link

feldrim 1637 days ago

Windows allows you to have optional IDs.

link

crispyambulance 1637 days ago

> Why allow file names? OS should provide a UUID for all files. No names, nothing.

On an application level that's sort-of starting happen. It's annoying though. Sometimes you just need to know where the actual F Apple put your photo's (it's not obvious). If different applications need to work with the same files, then there's an annoying coordination problem if one application tries to pretend that "files" don't exist and another needs a file path.

Autodesk Fusion 360 tucks your projects into a cloud. I know there's some local cache, but there's no need to think about it because only Fusion-360 handles those "files" and I just worry about my project assets as presented to me by the UI. In that case, it's OK, but it also suggests a "walled-garden" of files for each application.

link

pklausler 1637 days ago

We could use SHA-256 for the UUIDs, map names to hashes in special directory files, and build a source code control system out of it too while we’re at it.

link

jdblair 1637 days ago

git outta here!

link

dahfizz 1637 days ago

> macOS is also in the same boat with Linux, going one step forward and allowing \null terminator even in the filenames.

Does that mean that there are files impossible to open with fopen on macos? How does any of that work?

link

Latty 1637 days ago

Unix filenames are just sequences of bytes, not defined as strings. Most programs parse them as utf-8, but there is nothing mandating that. Obviously that leads to problems.

link

ninkendo 1637 days ago

One pedantic qualification: any byte except 0x2f (`/`) or 0x00.

This actually rules out nearly any non-UTF8 character set (besides ASCII.)

Quote from Linus, which reminds me of Henry Ford’s “you can have any color you want, so long as it’s black”:

> And that one true format is UTF-8. End of story. If you try to talk to the kernel in UCS-2 or anything else, you _will_ fail.

https://lore.kernel.org/all/Pine.LNX.4.58.0402141827200.1402...

link

jcranmer 1637 days ago

> This actually rules out nearly any non-UTF8 character set (besides ASCII.)

It doesn't--pretty much any character set that has seen widespread use in the past few decades would be compatible. Any single-byte charsets that are ASCII compatible (such as most Windows CP* sets or the entire ISO-8859-* suite) would work. Most Asiatic charsets (e.g., EUC-JP, Shift-JIS, Big5, GBK) that use variable-width encodings follow the rule that characters in the 0x00-0x7f range are ASCII and subsequent characters in the 0x40-0xff range, and so are themselves compatible as well.

So actually the list of notable incompatible charsets is easier to write out: UTF-16, UTF-32, EBCDIC, and ISO-2022-* charsets (which are mode-switching).

link

ninkendo 1637 days ago

Eh, fair enough. While you’re correct, character sets that are “ascii, but something custom when the high bit is 1” are all just “ascii” to me, in that they are all mutually incompatible for anything other than the first 127 characters, and 8-bit encoding in general has been ubiquitous for nearly as long as ascii has been defined. (Meaning that when most people say “ascii”, they’re actually referring to one of those encodings in practice.)

Asiatic character sets are an interesting point though. I wonder how common they were at the time of what Linus wrote…

link

jcranmer 1637 days ago

> While you’re correct, character sets that are “ascii, but something custom when the high bit is 1” are all just “ascii” to me

Don't call them just "ASCII"--that only serves to confuse people. Call them 8-bit ASCII-compatible charsets if you need a collective noun, but note that they are very different.

> (Meaning that when most people say “ascii”, they’re actually referring to one of those encodings in practice.)

Having actually worked on charset handling, when most people say "ASCII", they mean "ASCII" and not anything else. If a document is labeled as ASCII, then generally it should be handled as Windows-1252. If a conversion function claims to convert ASCII to something else, and doesn't provide any error mechanism (which it really should), then it usually means ISO-8859-1 aka Latin-1 aka map each byte to the first 256 Unicode characters.

But I'd never see, e.g., a KOI8-R document referred to as ASCII, nor anything that claimed to be ASCII assumed to be a KOI8-R document.

> Asiatic character sets are an interesting point though. I wonder how common they were at the time of what Linus wrote…

https://4.bp.blogspot.com/-O4jXmTm7WWI/Tyw1As8jt7I/AAAAAAAAI...

At the time he wrote that, the main Asiatic charsets for Chinese and Japanese would have been more common than UTF-8. Maybe Korean as well, although Linus's message is around the time that UTF-8 overtook EUC-KR. In any case, anyone who knew anything about character sets at the time would have been well aware of Asiatic variable-width character sets.

link

ninkendo 1637 days ago

I appreciate your insight, but I just want to expand on one point:

> Having actually worked on charset handling, when most people say "ASCII", they mean "ASCII" and not anything else.

Approximately zero people are referring to a true, packed, 7-bit encoding when they say "ASCII". They're nearly always talking about an 8-bit character set, and in such cases, something must happen when the high bit is 1. (I've never seen one that plain ignores or uses error glyphs for characters >127, although you likely have more experience with this than I do.) This is why I said people are referring to one of these encodings in practice... because ascii is 7-bit, and approximately everyone is talking about some 8-bit encoding of one form or another.

I would definitely agree that most wouldn't call KO18-R "ascii", but they may use the term "ascii" to describe the first 128 characters of KO18-R. (Notwithstanding if it uses weird replacement characters like Shift_JIS does with the backslash and the yen sign.) This is the reason for my comment about how the weird "ascii + custom" all just feels like ascii to me... if you stay below 128 it literally is.

I'll modify my original statement thusly:

> This actually rules out nearly any character set that isn't compatible with ASCII.

And add an addendum that if you don't use UTF-8, you can't use unicode and will be stuck in code page/locale hell.

link

CRConrad 1634 days ago

> Having actually worked on charset handling, when most people say "ASCII", they mean "ASCII" and not anything else.

Most American people, maybe.

link

dylan604 1637 days ago

I see your pedantic and raise you: UTF-8 isn't a font though. It's a text encoding.

link

marklgr 1637 days ago

String bets not allowed, whatever their encoding ;)

link

amptorn 1637 days ago

> Unix filenames are just sequences of bytes, not defined as strings

"Write programs to handle text streams, because that is a universal interface except for filenames which are opaque binary"

link

dzaima 1637 days ago

Why not also, while at it, disallow spaces too? They can very easily cause problems too, if you split by spaces instead of newlines. Quotes and backslashes obviously are also bad. How about all of non-ASCII unicode? That'd break all code assuming character count equals byte count, and can probably cause buffer overflows when people count correctly.

Any characters you disallow still allows people to fail on some other character. Sure, it'd decrease the likelihood of messing things up by some amount, but that's a half-assed solution at best, and would make people check for mistakes less at worst. Imagine if intel fixed the pentium FDIV bug by only fixing 30% of the wrong results.

link

jl6 1637 days ago

I can’t think of why you’d ever want a newline in a filename, but it does make for easier reasoning about what characters (or perhaps I should say bytes) could be found in filenames, as opposed to having to remember a long list of exceptions.

link

jlarocco 1637 days ago

> That's just such an obviously brain-damaged idea.

Is it, though? "Every character except '/' because it's the directory delimiter" seems pretty straight forward to me...

> There's not a single rational use case for it, yet it breaks nearly every text-based tool you could possibly imagine...

You don't have a use case, but that doesn't mean nobody else has one.

And as far as "text-based tools" go, their developers should RTFM. I'm fairly sure UNIX existed before almost all of them, and it's accepted new lines all along.

link

tyingq 1637 days ago

It is odd. Though tools like find have "-print0" for this purpose. And corresponding input flags for xargs, perl, sort, uniq, cut, head, etc, that accept NUL terminated vs newline terminated lists.

link

kroltan 1636 days ago

No, write your software properly. Assuming anything at all about file names is how we get to silly things like Windows' "CON" or whatever restrictions.

link

mistrial9 1637 days ago

my imagined reason is -- because when that terrible day happens, and an important file with some new name, does in fact get a newline in it, the rest of the system now has predictable code paths. Q. Is this related to perl, who knows

link