Hacker News new | ask | show | jobs
by ploxiln 2115 days ago
This feature doesn't facilitate you at all. It's a historical mistake that macOS and Windows have preserved, and refined a bit over the years.

The only real purpose of this is for easier/faster compatibility with software developed and tested on macOS and Windows, which has accidental case inconsistency in file name references in the code, which happens to work fine on macOS and Windows.

TFA could have made that argument, but it didn't, it made incorrect arguments instead. For example, a user might type in a lower-case name for a file one time, and a capitalized name for a file another time, and intend to access the same file. But what user is typing in a whole filename the second time? They're picking it from a list, or if they're very advanced, using completion in a terminal. TFA also mentions non-English languages, in the context of unicode normalization ... but non-western-european languages won't be handled correctly by any universal case-folding algorithm anyway, with Turkish being the most common example.

The whole strategy just doesn't work out well. Many low-level filesystem developers have known this for over 20 years. It doesn't work better for anyone, but non-technical people just aren't aware of why or how it increases complexity and costs, and reduces performance and reliability.

2 comments

How would one go about implementing case-insensitive path lookup against a SMB share containing a few million files from userspace?

It must be in the kernel. Else implementations of stuff with human-derived semantics should move out of the kernel, but moving SMB into userspace would of course be ridiculous

So it complicates any mapping of filenames to data structures in the kernel (all 3 of them?), big deal. Every popular desktop operating system supports it, and basically the only reason Linux does not is a mixture of FUD and fear that we may need to update the code at some point due to changes in human culture, the horror!

Meanwhile, typing "cd music" in a terminal need not print "Command not found" when there is clearly a folder named "Music", like the $3k worth of gear in front of me had the complexity of some 1950s b-movie scifi robot

> How would one go about implementing case-insensitive path lookup against a SMB share containing a few million files from userspace? > > It must be in the kernel. Else implementations of stuff with human-derived semantics should move out of the kernel, but moving SMB into userspace would of course be ridiculous

Case-insensitive path lookups on SMB happens in the server (usually samba), in user-space. The client is also usually in user-space through FUSE or the client libs, but CIFS of course exists as a kernel-mode alternative.

And as I have written elsewhere, sure, "music" vs. "Music" is simple when you live in an ASCII world. Trying to be smart with user input only causes trouble for the rest of us. Hiragana and katakana is also logically the same, and on a kana keyboard a similar typo. Simplified and traditional chinese is also logically the same.

> Every popular desktop operating system supports it

Doesn't mean we should break stuff here as well.

> How would one go about implementing case-insensitive path lookup against a SMB share containing a > few million files from userspace? It must be in the kernel.

SMB in the kernel is a rather dangerous game. Not saying it can't be done (and I know Samsung is doing it :-), but it's a significantly more complex protocol than even NFSv4 (which also really shouldn't be in the kernel either). For complexities sake, userspace is easier IMHO (much easier to debug).

Also, you seem to be of the opinion that code in the kernel must be "magic", in that it can do things that user-space can't. The "missing case" lookup problem still exists for kernel code as it does for user-space code.

Looking up "foo" in all case variations in a file containing a few million files still has to be done by search in the kernel as it does in user-space. It's going to be slow there too unless you provide a case-insensitive indexing mechanism.

Now the kernel offers easier opportunities to do things like directory content caching, which currently aren't exposed to user space via any API - but once you have to do something like directory content caching it's also possible to expose that feature to userspace via an API.

> Else implementations of stuff with human-derived semantics should move out of the kernel, but moving SMB into userspace would of course be ridiculous

Samba would beg to disagree of course :-).

Surely the mistake was case sensitive filenames. I can't think of a single end-user use case where this behavior is desirable.
In a world of ASCII, maybe. But fixing one problem for a small group of people is a giant can of worms for the rest of the world. Normalization of compound characters, exotic character sets, emoji, different classes of upper/lower-case letters, normalization of compound what-not.

And even then it still doesn't fix the issue outside of the world of ASCII. A filename written in hiragana and katakana is logically the same to the end-user, but they are still distinct. Simplified and traditional Chinese, Hangul and romaja, pinyin, devanagari, thai, and the list goes on.

Case-insensitive filenames fixes nothing, but breaks everything. There is only one sensible thing to do with arbitrary user-input, and that is to leave it be.

> There is only one sensible thing to do with arbitrary user-input, and that is to leave it be.

I wish programmers would believe this about names and addresses! My wife has a two-word first name, and I have a two-word family name.

Ensuring filenames don't get destroyed by an OS that refuses to understand a given language. Case is a complicated mess once you leave ASCII, and that's partially because ASCII is lying to you about how English case works: Yes, the English language has title case, and ASCII conflating it with upper-case does not negate that.

Move on to most anywhere else and the notion that it's fast, reliable, and safe to convert case gets lost in the realities of human writing systems.