| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by qalmakka 1645 days ago

> Pascal and old basic (among other languages) have been case insensitive for decades

... because back then Unicode was still a pipe dream in the mind of some visionary. Everything used 8 bit encoding, and everyone assumed ASCII or at least something compatible with the 7 bit subset of ASCII.

Nowadays files are formatted in UTF-8, and most modern languages actually fully support UTF-8 identifiers. Nim itself supports UTF-8 "letters" in identifiers, and what is "upper case" or "lower case" 100% depends from the current locale. Restricting your case normalization logic to ASCII is __really bad__, because it basically means that non-Latin letters in identifiers won't be normalized, with possibly unexpected consequences.

> APL is the only real language that doesn’t impose a western character set.

- Rust uses UTF-8: https://doc.rust-lang.org/reference/identifiers.html

- Go allows any Unicode letter in identifiers: https://go.dev/ref/spec#Identifiers

- Swift is also famous for allowing you to use emojis in identifiers.

- Python supports non-ASCII identifiers: https://www.python.org/dev/peps/pep-3131/

And the list goes on. Even C++ can optionally support Unicode in identifiers (for instance, Clang and GCC do indeed support things like `constexpr auto 黒 { "lol" };`).

1 comments

beagle3 1645 days ago

> ... because back then Unicode was still a pipe dream in the mind of some visionary. Everything used 8 bit encoding, and everyone assumed ASCII or at least something compatible with the 7 bit subset of ASCII.

They could still have been case sensitive (C was), so I don't understand how that's relevant to the idea that "case insensitivity is a problem".

> Rust, Go, Swift, Python

All of these languages impose ASCII for their keywords and directives. They allow you to use other characters for identifiers, but impose ascii in everything that has pre-defined semantics. Original APL is the only "real"/practical language that I'm aware of that gave up the "western centric view" of the world to the point that it doesn't have a single English keyword. (Brianfuck, etc. exist as well, but ....)

And they all impose a left-to-right reading order, which is just as western-centric. Arabic/Farsi/Hebrew go right-to-left, and there are languages that can also go top-to-bottom.

I think the outrage about "western centrism" is misguided. This is a formal system, and just like math, it reflects some history by using latin letters and left-to-right for the predefined symbols, and even preferred use of latin characters in identifiers.

> Nim itself supports UTF-8 "letters" in identifiers, and what is "upper case" or "lower case" 100% depends from the current locale.

If that's true, that may be a problem. I'll look into that, thanks for pointing out - from memory, Nim only folds the lower 7-bit by a 32 difference in ascii code, so it is well defined regardless of locale, but I'll check.

The whole idea of utf-8 in identifiers is a minefield, whether you fold case or not; e.g:

"Εхаｍрⅼе" and "Example" have no single letter in common (I chose them that way using[0]) and no language that allows utf-8 identifiers is going to warn you about that.

[0] https://www.irongeek.com/-attack-generator.php

link

qalmakka 1644 days ago

> They could still have been case sensitive (C was), so I don't understand how that's relevant to the idea that "case insensitivity is a problem".

The point here is that case insensitivity is only a viable option if you severely limit the encoding allowed in whatever you are using - be it a programming language, filesystem, etc. If the encoding of your files is something is basically akin to ASCII or ISO-whatever (which was what BASIC and Pascal used back in the day) then case insensitivity is trivial and safe.

This whole thing breaks apart as soon as you enter a Unicode world and start accepting identifiers containing more than ASCII, and then the whole concept of "case insensitive" becomes obsolete and outright wrong.

The Unicode equivalent of "case insensitive" is Normalization [0] and it's a big heck of a minefield because it is defined depending on the locale in use. For instance, "FILE.TXT" and "file.txt" are to be considered equivalent under en_US, but not under tr_TR, where the lower case version of "FILE.TXT" is "fıle.txt" and the upper case version of "file.txt" is "FİLE.TXT". This means that normalizing strings can cause to unexpected results depending on the locale, which is especially problematic with filesystems (where a path may exist or not depending on the locale).

> Nim only folds the lower 7-bit by a 32 difference in ascii code, so it is well defined regardless of locale

yes, it is well defined but allowing the entirety of the Unicode letters also means that identifiers may contain glyphs from alphabets that have separate cases, chiefly Greek and Russian, or even accented letters such as `è` or `ö`. Case insensitivity instead of proper normalization makes them potentially confusing, and quite breaks the intent behind allowing Unicode identifiers by making non-US locales second class citizens.

IMHO it is arguably very confusing to non-English speakers that 'mela' is equivalent to 'MELA' but 'tè' isn't equivalent to 'TÈ' while 'Tè' is. It basically means you have to remember what letters are ASCII and what are not, which makes the whole "case insensitive" a potential source of confusion.

I think it is safe to say that in 2021 case insensitivity is an obsolete concept and an obstacle to proper internationalization. Case insensitivity only really works on legacy encodings and with the basic Latin alphabet, and you can rest assured it will be almost always improperly implemented anyway.

[0] https://en.wikipedia.org/wiki/Unicode_equivalence

link

beagle3 1642 days ago

I understand your point, but still disagree with it. As I see it, the real problem is unicode identifiers, as I demonstrated with "Example" above, and as follows from your demonstrations as well. Unlike the thousands of unicode characters, which are unlikely to be all familiar to any single person, and whose meaning and "conjugation" (casing, conjugation, pre-joined pairs, precomposed versions, etc) are different in different cultures -

The ascii case folding, as employed by Nim and Pascal refers to 26 specific well known characters. It's a non-issue.

link