Hacker News new | ask | show | jobs
by korlja 1434 days ago
.ToUpper() is locale-dependent, so can only be used if the locale of the text in question is known. E.g. German ß capitalizes to SS, and .ToUpper().ToLower() should give you either 'ss' or 'ß' depending on what it was before. Always outputting 'ss' is okish and readable, but actually wrong.

Blindly calling .ToUpper() on anything is a typical anglo-centric mistake. Just don't use .ToUpper(), shoutcase is ugly anyways ;)

See also: one of the many "100 fallacies programmers assume about natural written language" documents or such.

6 comments

Small nitpick: uppercase ẞ was added to Unicode 5.1 in 2007 (https://unicode-table.com/en/1E9E/) and is considered correct German orthography since 2017 (see §25 E3 in https://grammis.ids-mannheim.de/rechtschreibung/6180#par25E3)
How often do you see the new letter in German everyday life? Despite being German myself I don't visit Germany that often these days, I still read a couple of German publications regularly. I have never seen the new letter outside of discussions by software people about character handling.
I do sometimes, but I'm rather sensitive for the ẞ issue: My last name contains an ß and uppercasing would either mean keeping the ß lowercase – the Personalausweis does that (†) and it looks ugly — or doing the ß → SS transformation which is somewhat forbidden in identity documents; a name must be exact. Hence, someday in the future, hopefully, the ẞ. While personal names were a major motivation for the inclusion of the ẞ into Unicode, I’m always happy to see it in the wild in press or book titles or such.

† Although it’s Germany and of course there exists an obscure Verwaltungsvorschrift according to which you can write the non-machine readable field of the Personalausweis/Pass in lowercase, exactly for this use case. I didn’t know that last time but I fully intend to make some poor civil servants life a slight hell the next time I have to renew.

I assumed it was added for shop signs and product packaging (I.e. as a gimmick).

Speaking of surviving Fraktur ligatures, I’m sorry that a couple of others like tz didn’t make it to Roman. It makes poor ß appear lonely.

I was actually wondering if the driving factor is legal documents. ID cards show names in all-caps letters, which creates the dilemma that your ID might not show your actual name (notwithstanding international standards for travel documents that prescribe transliteration of non-latin characters; see ICAO Doc 9303 Part 3, section 6 [0] for examples)

[0]: https://www.icao.int/publications/Documents/9303_p3_cons_en....

That’s a good theory, especially as section 3.1 of that ICAO document explicitly permits the use of ß.

Bringing the thread back to the topic of this comment section: the ICAO document also calls the digits 0123456789 “Arabic” even though their shapes are closer to the original Hindi (Devanagari) forms than to actual Arabic digits — another “Hindi/Turkey” situation

That is correct and solves the roundtrip-problem (in this case and language). But uppercase 'ẞ' is just an additional option at the discretion of the writer, the recommended variant continues to be 'SS'.
> German ß capitalizes to SS, and .ToUpper().ToLower() should give you either 'ss' or 'ß' depending on what it was before

As long as there is no unicode SS character, we are into the "what color are your bits" problem or tolower needs to be language and word aware.

In .NET the uppercase and lowercase functions are culture aware (with defaults to system settings, which breaks more software than you might think) but not word aware AFAIK.

> As long as there is no unicode SS character, we are into the "what color are your bits" problem or tolower needs to be language and word aware.

It turns out there is such a unicode character -- ẞ/ß -- although based on other comments here it looks like it was added fairly recently.

Upper/Lower case stuff just seems to be at an annoying intersection where it has cultural and also programming significance. Or at least, people will use toUpper when they really want some case-insensitive sortable version of the string.

(based on some googling, probably localeCompare is the way to go in javascript at least).

I hate the locale nonsense. The decimal point is `.` and not `,`. The rest of this stupid country is wrong
>Blindly calling .ToUpper() on anything is a typical anglo-centric mistake.

Yes, one that you might make if you were for example, trying to make English text uppercase. Which is why it would be daft for anyone to suggest that their country has two different English spellings depending on the character case.

.toUpper() is a quick and mostly effective way to normalise strings for comparison if you're not sure what case the two strings to compare are in (eg: one has been input by a user). Yes, it's a shortcut, and occasionally you'll end up with a miss, but it's good enough to work 99% of the time, and the alternative is a LOT of code and data changes to handle a very small proportion of cases.
Hmm I think you miss the point. In some programming environments (like C# and Java) .toUpper() is always incorrect in code unless you are displaying the resulting string in a UI, as it uses the "current locale", which is whatever the user has selected for the machine. When e.g. comparing strings case-insensitively, you should always explicitly specify the locale where the conversion should happen instead of relying on an external configuration variable.

JavaScript actually seems to be the smart one here - its default .toUpperCase() uses the "locale-insensitive case mappings in the Unicode Character Database".

> the user has selected for the machine

I don't think most Java and C# software is desktop apps? Surely in most cases it's the locale selected for the server or VM, which should be consistent?

(I'm not saying it's good coding practice, mind you, but it probably ends up accidentally working in a lot of cases.)

You write like you can know how and where the code will get executed in the future. :) Do you think that the authors of Windows 95 ever imagined the system would one day get ported to an obscure subset of a functional scripting language (Asm.js variety of JavaScript), and get booted in a hyper-text browser running on a PDA device with internet connection (web browser on a smartphone)? Yet - here we are: https://win95.ajf.me/

> I'm not saying it's good coding practice, mind you, but it probably ends up accidentally working in most cases

Fully agree. It's still bad practice and I high-five every linter that automatically flags it.

I did indeed. Thanks - yes, I was referring to JavaScript's .toUpperCase(), silly oversight and assumption on my side.

Thanks for the correction!

You make a good case (ha!). What if toUpper() and toLower() were omitted from standard libraries? Usually they are used, incorrectly, to do something like string comparison, which could be better served by a more specific method.
Only sz should use ß. Ss stays ss even in German-german. Switzerland got rid of the sz/ss distinction a long time ago. So you need to be culture and word aware to do it „right“.
'sz' for 'ß' is sometimes used to make things roundtrip-proof in capslock, e.g. on military stencils. HTML calls it 'szlig'. Also, some use "Esszet" as the name of the character. But all are wrong in that ß isn't a ligature of s and z, it is a ligature of s and s. The shape of the character stems from the fact that in fractur writing and even some grotesk fonts, 's' at the end of a word was written 's', while 's' within a word was written 'ſ'. Thus the end of a word like Fuss was written Fuſs, giving a ligature of Fuß. No 'z' anywhere.
Originally ß arose as a ligature of s and z, or rather ſ and ʒ. In many older texts, or even current fonts, the second part of the ligature is indisputably a long-tailed ʒ

https://en.wikipedia.org/wiki/%C3%9F

> some use "Esszet" as the name of the character

I believe the actual name is Eszett.

Only “wrong” in light of current usage, but not historically.

By this measure, the English name of “W” would be wrong because it’s not actually a “double-U” but a “double-V”. But at the time of the letter’s formation, U and V were not yet separate letters.

https://en.wikipedia.org/wiki/W

The Swedes get this "right", and call it [ˈdɵ̂bːɛlˌveː].

https://en.wikipedia.org/wiki/Swedish_alphabet

In Dutch it's even more sane, the alphabet just goes V = vee, W = wee.
Oh wow, didn’t know that!
French as well, although the elegance gained is quickly tarnished by calling y "Greek i".
I always thought that German z used to look something between Ꙁ & з. ʒ looks pretty close so ſз became ß but Latin transliteration rules were ss instead. At least that's what I was taught in German class.