| > ... because back then Unicode was still a pipe dream in the mind of some visionary. Everything used 8 bit encoding, and everyone assumed ASCII or at least something compatible with the 7 bit subset of ASCII. They could still have been case sensitive (C was), so I don't understand how that's relevant to the idea that "case insensitivity is a problem". > Rust, Go, Swift, Python All of these languages impose ASCII for their keywords and directives. They allow you to use other characters for identifiers, but impose ascii in everything that has pre-defined semantics. Original APL is the only "real"/practical language that I'm aware of that gave up the "western centric view" of the world to the point that it doesn't have a single English keyword. (Brianfuck, etc. exist as well, but ....) And they all impose a left-to-right reading order, which is just as western-centric. Arabic/Farsi/Hebrew go right-to-left, and there are languages that can also go top-to-bottom. I think the outrage about "western centrism" is misguided. This is a formal system, and just like math, it reflects some history by using latin letters and left-to-right for the predefined symbols, and even preferred use of latin characters in identifiers. > Nim itself supports UTF-8 "letters" in identifiers, and what is "upper case" or "lower case" 100% depends from the current locale. If that's true, that may be a problem. I'll look into that, thanks for pointing out - from memory, Nim only folds the lower 7-bit by a 32 difference in ascii code, so it is well defined regardless of locale, but I'll check. The whole idea of utf-8 in identifiers is a minefield, whether you fold case or not; e.g: "Εхаmрⅼе" and "Example" have no single letter in common (I chose them that way using[0]) and no language that allows utf-8 identifiers is going to warn you about that. [0] https://www.irongeek.com/-attack-generator.php |
The point here is that case insensitivity is only a viable option if you severely limit the encoding allowed in whatever you are using - be it a programming language, filesystem, etc. If the encoding of your files is something is basically akin to ASCII or ISO-whatever (which was what BASIC and Pascal used back in the day) then case insensitivity is trivial and safe.
This whole thing breaks apart as soon as you enter a Unicode world and start accepting identifiers containing more than ASCII, and then the whole concept of "case insensitive" becomes obsolete and outright wrong.
The Unicode equivalent of "case insensitive" is Normalization [0] and it's a big heck of a minefield because it is defined depending on the locale in use. For instance, "FILE.TXT" and "file.txt" are to be considered equivalent under en_US, but not under tr_TR, where the lower case version of "FILE.TXT" is "fıle.txt" and the upper case version of "file.txt" is "FİLE.TXT". This means that normalizing strings can cause to unexpected results depending on the locale, which is especially problematic with filesystems (where a path may exist or not depending on the locale).
> Nim only folds the lower 7-bit by a 32 difference in ascii code, so it is well defined regardless of locale
yes, it is well defined but allowing the entirety of the Unicode letters also means that identifiers may contain glyphs from alphabets that have separate cases, chiefly Greek and Russian, or even accented letters such as `è` or `ö`. Case insensitivity instead of proper normalization makes them potentially confusing, and quite breaks the intent behind allowing Unicode identifiers by making non-US locales second class citizens.
IMHO it is arguably very confusing to non-English speakers that 'mela' is equivalent to 'MELA' but 'tè' isn't equivalent to 'TÈ' while 'Tè' is. It basically means you have to remember what letters are ASCII and what are not, which makes the whole "case insensitive" a potential source of confusion.
I think it is safe to say that in 2021 case insensitivity is an obsolete concept and an obstacle to proper internationalization. Case insensitivity only really works on legacy encodings and with the basic Latin alphabet, and you can rest assured it will be almost always improperly implemented anyway.
[0] https://en.wikipedia.org/wiki/Unicode_equivalence