Hacker News new | ask | show | jobs
by lultimouomo 752 days ago
Case insensitive matching is a surprisingly complicated, locale-dependent affair. Should I.txt and i.txt match? (Note that the first file is not named I.txt).

Case insensitive filesystems make about as much sense as ASCII-only filenames.

3 comments

How would locale matter?
Off the top of my head, in turkish, `i` doesn't become `I`, it becomes `İ`. And `ı` is the lower case version of `I`
You don't need to decide how to upper or lower case a character to be insensitive to case, though. Treating them all as matching isn't a terrible option.
For example, it depends on the locale if the capitalized form of ß is ß or SS.
And yet case insensitive file name matching / string matching is one of my favourite windows features. It’s enormously convenient. An order of magnitude more convenient than the edge cases it causes me.

People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character, that they are different ASCII codes is a behind the scenes implementation detail.

(That said, S3 isn’t a filesystem, it’s more like a web hashtable key-to-blob storage)

> People aren’t ASCII or UTF-8 machines; “e” and “E” are the same character

They are the same character to you, a native speaker of a Western language written in a latin script. They are the same to you because you are, in fact, an ASCII machine. Many many people in the world are not.

They are the same to me, they are different in ASCII, therefore I am not an ASCII machine. To me, the person using the computer to do work. Not the person wanting to do extra work to support the computer's internal leaky abstractions of data storage.

Your position, the position of too many people, is that I a native speaker of English etc. should not be allowed to have a computer working how English works because somewhere, someone else is different. This is like saying I shouldn't be allowed an English spell checker because there are other people who speak other languages.

> “e” and “E” are the same character

They don't look like the same character to me. A character is a written symbol. These are different symbols.

What definition of "character" are you using where they're the same character?

I haven't ruled out that I am wrong, this is a naive comment.

Are the words hello and HELLO spelled differently? I am pretty squarely in the camp that filesystems should be case sensitive (perhaps with an insensitive shell on top), but I would not consider those two words as having a different spelling. To me that means they are the same sequence of characters.
You are confusing characters with glyphs. A glyph is a written symbol.
And you seem to be conflating characters and letters. There are fewer letters in the standard alphabet than we have characters for the same, largely because we do distinguish between some letter forms.

I suppose you could imagine a world where we don't, in fact, do this with just the character code. Seems fairly different from where we are, though?

I thought that if they're different glyphs they're different characters.

Surely the fact that they're represented differently in ASCII means ASCII regards them as different characters?

Whether they're different glyphs or not depends on the font.

When you press the "E" key on a US keyboard and "e" comes out, do you return the keyboard because it's broken? If not, then you know what definition I'm using even if I misnamed it.
> It’s enormously convenient. An order of magnitude more convenient than the edge cases it causes me.

Can you elaborate on this?

Every single time I type a path or filename (or server name) in the shell, or in Windows explorer, or in a file -> open or save dialog, I don't trip over capitalization. If I want to glob files with an 'ecks' in the name I can write *x* and not have to do it twice for *x* and *X*.

When I look at a directory listing and it has "XF86Config", I read it in my head as "ecks eff eight six config" not "caps X caps F num eight num six initial cap Config" and I can type what I read and don't have to double-check if it's config or Config.

Tab completion works if I type x<tab> instead of blanking on me and making me double check and type X<tab>.

Case sensitivity is like walking down a corridor and someone hitting you to a stop every few steps and saying "you're walking Left Right Left Right but you should be walking Right Left Right Left".

Case insensitivity is like walking down a corridor.

In PowerShell, some cmdlets are named like Add-VpnConnection where the initialism drops to lowercase after the first letter, others like Get-VMCheckpoint where the initialism stays capitalised, others mixed like Add-NetIPHttpsCertBinding where IP is caps but HTTPS isn't - any capitalisation works for running them or searching them with get-command or tab-completing them. I don't have to care. I don't have to memorise it, type it, pay attention to it, trip over it, I don't have to care!.

"A programming language is low level when its programs require attention to the irrelevant." - Alan Perlis.

DNS names - ping GOOGLE.COM works, HTTPS://NEWS.YCOMBINATOR.COM works in a browser, MAC addresses are rendered with caps or lowercase hex on different devices, so are IPv6 addresses in hex format, email addresses - firstname.lastname or Firstname.Lastname is likely to work. File and directory access behaving the same means it's less bother. In Vim I :set ignorecase.

In PowerShell even string equality check is case insensitive by default, string match and split too. When I'm doing something like searching a log I want to see the english word 'error' if it's 'error' or 'ERROR' or 'Error' and I don't know what it is.

If I say the name of a document to a person I don't spell out the capitalisation. I don't want to have to do that to the computer, especially because there is almost no reason to have "Internal site 2 Network Diagram" and "INTERNAL site 2 network diagram" and "internal site 2 NETWORK DIAGRAM" in the same folder (and if there were, I couldn't easily keep them apart in my head).

All the time in command prompt shell, I press shift less often, type less, change directories and work with files more smoothly with less tripping over hurdles and being forced to stop and doublecheck what I'm tripping over when I read "word" and typed "word" and it didn't work.

On the other hand, the edge cases it causes me are ... well, I can't think of any because I don't want to put many files differing only by case in one directory. Maybe uncompressing an archive which has two files which clash? I can't remember that happening. Maybe moving a script to a case sensitive system? I don't do that often. In PowerShell, method calls are case insensitive. C# has "string".StartsWith() and JavaScript has .startsWith() and PowerShell will take .startswith() or .StartsWith or .Startswith or anything else. That occasionally clashes if there's a class with the same name in different case but that's rare, even.

In short, the computer pays attention to trivia so I don't have to. That's the right way round. It's about the best/simplest implementation of Do What I Mean (DWIM) that's almost always correct and almost never wrong.

If I want to glob files with an 'ecks' in the name I can write x* and not have to do it twice for x and X.*

Adding

  shopt -s nocaseglob
to ~/.bashrc makes globbing case-insensitive in bash[1].

Tab completion works if I type x<tab> instead of blanking on me and making me double check and type X<tab>.

Adding

  set completion-ignore-case on
to ~/.inputrc makes completion case-insensitive in bash (and other programs that use libreadline)[2].

Both options are independent of file system case-sensitivity.

[1] https://www.gnu.org/software/bash/manual/html_node/The-Shopt...

[2] https://tiswww.cwru.edu/php/chet/readline/readline.html#inde...

> Both options are independent of file system case-sensitivity.

In Windows world it works everywhere, in any win32 program - file open dialogs, et al. Here you have to have it built in to every tool. (and windows doesn't do it at the filesystem layer)

None of these are the filesystem though, they are all abstractions over the file system that could easily implement case insensitivity, and as a sibling comment pointed out, actually do in many cases. I'm perfectly fine with the idea of interacting with files using a case insensitive interface. I just don't feel like it should be the job of the filesystem to enforce case insensitivity.
Complicated for who? I've little pity for developers and kernels ease of life as a user.