| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by qalmakka 1643 days ago

> They could still have been case sensitive (C was), so I don't understand how that's relevant to the idea that "case insensitivity is a problem".

The point here is that case insensitivity is only a viable option if you severely limit the encoding allowed in whatever you are using - be it a programming language, filesystem, etc. If the encoding of your files is something is basically akin to ASCII or ISO-whatever (which was what BASIC and Pascal used back in the day) then case insensitivity is trivial and safe.

This whole thing breaks apart as soon as you enter a Unicode world and start accepting identifiers containing more than ASCII, and then the whole concept of "case insensitive" becomes obsolete and outright wrong.

The Unicode equivalent of "case insensitive" is Normalization [0] and it's a big heck of a minefield because it is defined depending on the locale in use. For instance, "FILE.TXT" and "file.txt" are to be considered equivalent under en_US, but not under tr_TR, where the lower case version of "FILE.TXT" is "fıle.txt" and the upper case version of "file.txt" is "FİLE.TXT". This means that normalizing strings can cause to unexpected results depending on the locale, which is especially problematic with filesystems (where a path may exist or not depending on the locale).

> Nim only folds the lower 7-bit by a 32 difference in ascii code, so it is well defined regardless of locale

yes, it is well defined but allowing the entirety of the Unicode letters also means that identifiers may contain glyphs from alphabets that have separate cases, chiefly Greek and Russian, or even accented letters such as `è` or `ö`. Case insensitivity instead of proper normalization makes them potentially confusing, and quite breaks the intent behind allowing Unicode identifiers by making non-US locales second class citizens.

IMHO it is arguably very confusing to non-English speakers that 'mela' is equivalent to 'MELA' but 'tè' isn't equivalent to 'TÈ' while 'Tè' is. It basically means you have to remember what letters are ASCII and what are not, which makes the whole "case insensitive" a potential source of confusion.

I think it is safe to say that in 2021 case insensitivity is an obsolete concept and an obstacle to proper internationalization. Case insensitivity only really works on legacy encodings and with the basic Latin alphabet, and you can rest assured it will be almost always improperly implemented anyway.

[0] https://en.wikipedia.org/wiki/Unicode_equivalence

1 comments

beagle3 1641 days ago

I understand your point, but still disagree with it. As I see it, the real problem is unicode identifiers, as I demonstrated with "Example" above, and as follows from your demonstrations as well. Unlike the thousands of unicode characters, which are unlikely to be all familiar to any single person, and whose meaning and "conjugation" (casing, conjugation, pre-joined pairs, precomposed versions, etc) are different in different cultures -

The ascii case folding, as employed by Nim and Pascal refers to 26 specific well known characters. It's a non-issue.

link