Hacker News new | ask | show | jobs
by akira2501 2865 days ago
> So...what's special about Ohms, Kelvins, and ångström

Nothing other than misguided thinking in the early versions of the standard.

The other problems with these special symbols is that if you call tolower() or similar on them they'll return the "normal" character they're based off of. So toupper(tolower(char)) != char.

1 comments

Does tolower() or toupper() even make sense with general unicode characters? I wouldn't expect it to... but I've never really thought about it before :-)
Mostly, we're used to defining tolower() and toupper() to return either a lower or upper case variant if one exists, otherwise you get back what you put in. For most Unicode codepoints no such variants exist and so you just get back whatever you fed in. Some of the alphabets have uppercase/ lowercase, but obviously most writing systems don't do this.

However, lower(upper(X)) is not defined to be the same as lower(X), and there's no promise that meddling with a string transforming with lower() or upper() does what you hoped because that isn't how language actually works (e.g. in English the case sometimes marks proper nouns so "May" is the Prime Minister of the UK, but "may" is just an auxiliary verb).

Where standards tell you something is case-insensitive, but it's also allowed to be Unicode rather than ASCII, you can and probably should "case crush" it with tolower() and then never worry about this problem. In a few places you have to be careful because a standard says something in particular is case-insensitive, but not everything that goes in that slot is case-insensitive. For example MIME content type names like "text/plain", "TEXT/PLAIN" and "Text/Plain" are case-insensitive, but

multipart/mixed; boundary="ABCDEFGHIJKL" multipart/mixed; boundary="abcdefghijkl" multipart/mixed; boundary="AbcDefGhiJkl"

... declare three different boundary tokens, and none of them matches the sequence abCdeFghIjkL.

What's worse, tolower() and toupper() are locale-dependent. In most locales,

  tolower("I") = "i"
but in Turkish,

  tolower("I") = "ı"
Same in the other direction, because there is also a large I with dot.