Hacker News new | ask | show | jobs
by capitol_ 750 days ago
The case-sensitivity algorithm needs a locale as input in order to correctly calculate the case conversion rules.

The most common example is probably that i (U+0069 LATIN SMALL LETTER I) and I (U+0049 LATIN CAPITAL LETTER I) transform into each other in most locales, but not all. In locales az and tr (the Turkic languages), i uppercases to İ (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE), and I lowercases to ı (U+0131 LATIN SMALL LETTER DOTLESS I).

case-insensitive is all fine if you only handle text that consist of A-Za-z, but as soon as you want to write software that works for all languages it becomes a mess.

2 comments

This is the main point, and almost all the other chatter is not particularly relevant. A dumb computer and a human can agree with "files are case sensitive and sometimes that's a bit weird but computers are weird sometimes". If there was indeed exactly one universal way to have case insensitivity it would be OK. Case insensitive file systems date from when there was. Everything was English and case folding in English is easy. Problem solved. But that doesn't work today. And having multiple case folding rules is essentially as unsolvable a problem as the problems that arise from case sensitivity, except they're harder for humans to understand, including programmers.

Simple and wrong is better than complicated and wrong and also the wrong is shoved under the carpet until it isn't.

Though you still ought to declare a Unicode normalization on the file system. Which would be perfectly fine if it weren't for backwards compatibility.

Minor nitpick: case-insensitive comparison is a separate problem from case conversion, and IIRC a little simpler. Still locale-specific.