Hacker News new | ask | show | jobs
by james412 2115 days ago
This was discussed to death on the kernel mailing lists, you should go read them.

The principal question is whether the tool is more important, or the end user. Why did I pay for this machine if it weren't intended to facilitate me? That's the bottom line with most of these kinds of technical "correctness" arguments

And as for whether userspace should catch up, thanks to OS X for the most part that already happened a long time ago for a ton of open source packages

9 comments

This feature doesn't facilitate you at all. It's a historical mistake that macOS and Windows have preserved, and refined a bit over the years.

The only real purpose of this is for easier/faster compatibility with software developed and tested on macOS and Windows, which has accidental case inconsistency in file name references in the code, which happens to work fine on macOS and Windows.

TFA could have made that argument, but it didn't, it made incorrect arguments instead. For example, a user might type in a lower-case name for a file one time, and a capitalized name for a file another time, and intend to access the same file. But what user is typing in a whole filename the second time? They're picking it from a list, or if they're very advanced, using completion in a terminal. TFA also mentions non-English languages, in the context of unicode normalization ... but non-western-european languages won't be handled correctly by any universal case-folding algorithm anyway, with Turkish being the most common example.

The whole strategy just doesn't work out well. Many low-level filesystem developers have known this for over 20 years. It doesn't work better for anyone, but non-technical people just aren't aware of why or how it increases complexity and costs, and reduces performance and reliability.

How would one go about implementing case-insensitive path lookup against a SMB share containing a few million files from userspace?

It must be in the kernel. Else implementations of stuff with human-derived semantics should move out of the kernel, but moving SMB into userspace would of course be ridiculous

So it complicates any mapping of filenames to data structures in the kernel (all 3 of them?), big deal. Every popular desktop operating system supports it, and basically the only reason Linux does not is a mixture of FUD and fear that we may need to update the code at some point due to changes in human culture, the horror!

Meanwhile, typing "cd music" in a terminal need not print "Command not found" when there is clearly a folder named "Music", like the $3k worth of gear in front of me had the complexity of some 1950s b-movie scifi robot

> How would one go about implementing case-insensitive path lookup against a SMB share containing a few million files from userspace? > > It must be in the kernel. Else implementations of stuff with human-derived semantics should move out of the kernel, but moving SMB into userspace would of course be ridiculous

Case-insensitive path lookups on SMB happens in the server (usually samba), in user-space. The client is also usually in user-space through FUSE or the client libs, but CIFS of course exists as a kernel-mode alternative.

And as I have written elsewhere, sure, "music" vs. "Music" is simple when you live in an ASCII world. Trying to be smart with user input only causes trouble for the rest of us. Hiragana and katakana is also logically the same, and on a kana keyboard a similar typo. Simplified and traditional chinese is also logically the same.

> Every popular desktop operating system supports it

Doesn't mean we should break stuff here as well.

> How would one go about implementing case-insensitive path lookup against a SMB share containing a > few million files from userspace? It must be in the kernel.

SMB in the kernel is a rather dangerous game. Not saying it can't be done (and I know Samsung is doing it :-), but it's a significantly more complex protocol than even NFSv4 (which also really shouldn't be in the kernel either). For complexities sake, userspace is easier IMHO (much easier to debug).

Also, you seem to be of the opinion that code in the kernel must be "magic", in that it can do things that user-space can't. The "missing case" lookup problem still exists for kernel code as it does for user-space code.

Looking up "foo" in all case variations in a file containing a few million files still has to be done by search in the kernel as it does in user-space. It's going to be slow there too unless you provide a case-insensitive indexing mechanism.

Now the kernel offers easier opportunities to do things like directory content caching, which currently aren't exposed to user space via any API - but once you have to do something like directory content caching it's also possible to expose that feature to userspace via an API.

> Else implementations of stuff with human-derived semantics should move out of the kernel, but moving SMB into userspace would of course be ridiculous

Samba would beg to disagree of course :-).

Surely the mistake was case sensitive filenames. I can't think of a single end-user use case where this behavior is desirable.
In a world of ASCII, maybe. But fixing one problem for a small group of people is a giant can of worms for the rest of the world. Normalization of compound characters, exotic character sets, emoji, different classes of upper/lower-case letters, normalization of compound what-not.

And even then it still doesn't fix the issue outside of the world of ASCII. A filename written in hiragana and katakana is logically the same to the end-user, but they are still distinct. Simplified and traditional Chinese, Hangul and romaja, pinyin, devanagari, thai, and the list goes on.

Case-insensitive filenames fixes nothing, but breaks everything. There is only one sensible thing to do with arbitrary user-input, and that is to leave it be.

> There is only one sensible thing to do with arbitrary user-input, and that is to leave it be.

I wish programmers would believe this about names and addresses! My wife has a two-word first name, and I have a two-word family name.

Ensuring filenames don't get destroyed by an OS that refuses to understand a given language. Case is a complicated mess once you leave ASCII, and that's partially because ASCII is lying to you about how English case works: Yes, the English language has title case, and ASCII conflating it with upper-case does not negate that.

Move on to most anywhere else and the notion that it's fast, reliable, and safe to convert case gets lost in the realities of human writing systems.

> Why did I pay for this machine if it weren't intended to facilitate me?

I happen to agree with the idea that the filename should be a dumb blob of bytes and the kernel should not do case folding, as it is the wrong layer for that, eg. the user can change their language but it won't update what has been written to the disk in thousands or millions of places where you could suddenly have a filename collision somewhere based on those rules changing.

But, I do hope you get that refund for your Linux.

That's IIRC how it used to be treated on ext based file systems until now. Everything allowed except for / and NUL bytes.
> dumb blob of bytes

Well, now your filename is invalid utf8. How should programs display it or even address such a file?

How does the UI framework act when you set a label to such payload? How does your web browser act when it sees it in HTML? I have found working on apps that see a lot of usage in varied markets that as much as we wish to see the best and ideal conditions, malformed utf-8 surfaces in the real world pretty often.
> How should programs display it

what's wrong with foo����.txt

> or even address such a file? ... by using the array of bytes ?

The fact that if one has two files, say “test{invalid bytes}.txt” and test{other invalid bytes}.txt”, both have replacement characters inserted at the same spot and would decode to the same codepoints.
It's ambiguous, for example.
so are a file named Hello.txt and another one named Нello.txt
> Well, now your filename is invalid utf8.

That's reality. An OS which can't keep up with reality is broken.

I understand that NTFS has its own case folding table which is written once when the volume is formatted. This does seem to have stood the test of time and enormous usage so maybe it is not such a poor idea.
That doesn't sound great if somebody formats your USB stick in Turkey and suddenly speakers of western European languages can observe 'i' as case sensitive.
I believe you’ve just shed light on a twelve year old bug in creating bootable USBs under Windows XP.
> filename should be a dumb blob of bytes

This hasn't been true since the days of CP/M

For e.g. the Linux kernel, besides path separator(s), why do you think that?

All of the wide/special-case manipulation when writing code on e.g. Windows drove me nuts.

Out of interest, what special-case manipulation? I generally treat file paths as opaque `\` separated strings (or even as a single blob if I don't need to parse it). I'm uncertain why I'd want to treat them specially. I'll leave that to the OS.
Take for example a file on NTFS.. the filenames can be UTF-16 (they support 16 bit chars under the hood).. but they also might not be valid UTF-16. When you access the file by the filename, you now have the potential to use all of the wide function calls (e.g. _wfopen) or the the ansi versions (e.g. fopen).
> That's the bottom line with most of these kinds of technical "correctness" arguments

The problem is emergent behavior. We can create any number of features, but we have a really hard time testing all the available configurations that result. Engineers rely on simplicity as a way of warding off this particular problem, because the other bottom line is people don't want to pay for a system that loses or invalidates the work they've put into it.

I disagree. The error is to consider paths as a high level information, that the user has to know about, rather of what they really are, a low level information, that potentially the user never sees (for example consider mobile operating systems like Android/iOS).

In practice the case insensitive thing if we want to call it should be implemented more high level, in the file manager, rather than in the filesystem itself. That is even what newer versions of Windows/NTFS do! Recent versions of NTFS are in fact case sensitive, and if you mount a NTFS volume with Linux in fact you can create two file with the name differing only by the case: the whole case-insensitive thing is handled at an higher level in the Windows APIs.

Arguably the biggest difference that I notice when using linux as a desktop environment as an end-user is that it trusts "you", the [sometimes root] user, to a far greater extent than other operating systems. It is for this reason that I enjoy it.

It also means that if you want to do something highly annoying and unusual, you can – and arguably case-insensitive filesystems are a subset of that.

I agree that case-insensitive filesystems should be an esoteric feature, but given that they’re the default on Windows and macOS, it should definitely be a well-supported option on Linux for the sake of compatibility.
I've set the Samba shares on my NAS to be case-sensitive, as making them case-insensitive slows down directory access by orders of magnitude.

I've been running this for years accessing them both from Linux desktops and Windows desktops, and only once have I had an issue that required me to manually rename something on the NAS.

This makes sense as most applications don't care about the filename, and will just use what you supply, or generate one and use that string all over.

Yep, that's true. It's the cache misses that kill performance. If the client asks for file "Foo", and the (l)stat fails to find it, then we have to scan the whole directory looking for any case-differing versions of "foo" "FOO" "fOo" etc.

Very costly, but the only way to give case-insensitivity.

> Very costly, but the only way to give case-insensitivity.

That is very costly — but it's certainly not the only way to provide case-insensitivity, nor is it the recommended way, and I'd be surprised if any implementation of case-insensitivity actually did what you say.

Normally one would lowercase (or uppercase) both strings, and then do the comparison.

The complexity here usually comes when case-folding various tricky locales.

You are incorrect. It is the only way for a user-space application to provide Windows-style case insensitivity.

Windows slient sends a name "foo" over SMB2+. Samba does a stat("foo", &st) call, gets ENOENT.

Now, does that name really not exist ? Or is it there in the requested directory under any of the names "FOO"/"FOo"/"Foo"/"fOo"/"fOO"/"FoO" etc. ?

The only way for an application to tell is to do:

opendir() fname = readdir().... check strlower(client_provided_name, fname); closedir().

If the file really doesn't exist you have ended up scanning the whole directory.

Because Linux doesn't provide directory leasing you have to do this for every missed lookup (another process might have created it in the meantime).

There is no POSIX API for "does this file exist under another cased name" ? If there were Samba would use it.

Which has caused countless problems with interpreted languages and cross platform functionality. Things lime ruby in OSX will gladly less you mangle your include strings on OSX, which causes a "works fine in Dev" problem. I'm not a fan of case insensitive filesystems because I have to manage services.
It's not enough to have similar matching at the fs level, applications need to have the same functionality around, otherwise anything that indexes the fs will have false negatives before reading the files.

Now, if this needs to be taken care of at the application level, then why have this misfeature? It'd be better to have a good library for matching this that could be aware of the language and locale (or maybe multiple lang,locale pairs) instead of throwing this feature into the fs and call it a day.

Also, if some applications benefit from having insensitive matching, like things built to run on Windows/Mac, then having a wrapper that fixed the fs access with this matching library would be enough, no need to force other applications to use insensitive matching because a single one needs it.

> And as for whether userspace should catch up, thanks to OS X for the most part that already happened a long time ago for a ton of open source packages

That’s a bold claim. Open source developer here (on the fish team): given a user-entered path, how do you make it canonical?

Specifically, without recursively opening directories and listing contents, how can you convert /foo/baR to the correct case as it appears in the filesystem index? Without following any symlinks, should any of the path components be a symlink?

Could you provide pointers to these discussions? As a technical person, case sensitive filesystems have been a loss both from a programmer's and also from an end user's perspective.
Here's Torvalds' view on the matter: https://lwn.net/ml/linux-fsdevel/CAHk-=wg2JvjXfdZ8K5Tv3vm6+b...

I also side with this view, namely that this is something that would be better placed in the userspace rather than the kernel, which really doesn't need more complexity for things that add so little value (negative value to some).

He must have changed his view because he ultimately allowed the feature. Does anyone know what changed his mind?
While Linus kicks off about things, he doesn't tend to outright refuse things. I don't think he sees himself as the gatekeeper of the kernel and that is evident in the way the kernel developed right from the beginning. That has attracted criticism from the likes of Ken Thompson who thought that too much crappy code was allowed into linux.

|I've looked at the source and there are pieces that are good and pieces that are not. A whole bunch of random people have contributed to this source, and the quality varies drastically