| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by korlja 1481 days ago

.ToUpper() is locale-dependent, so can only be used if the locale of the text in question is known. E.g. German ß capitalizes to SS, and .ToUpper().ToLower() should give you either 'ss' or 'ß' depending on what it was before. Always outputting 'ss' is okish and readable, but actually wrong.

Blindly calling .ToUpper() on anything is a typical anglo-centric mistake. Just don't use .ToUpper(), shoutcase is ugly anyways ;)

See also: one of the many "100 fallacies programmers assume about natural written language" documents or such.

6 comments

AdrianoKF 1481 days ago

Small nitpick: uppercase ẞ was added to Unicode 5.1 in 2007 (https://unicode-table.com/en/1E9E/) and is considered correct German orthography since 2017 (see §25 E3 in https://grammis.ids-mannheim.de/rechtschreibung/6180#par25E3)

usr1106 1481 days ago

How often do you see the new letter in German everyday life? Despite being German myself I don't visit Germany that often these days, I still read a couple of German publications regularly. I have never seen the new letter outside of discussions by software people about character handling.

ttepasse 1481 days ago

I do sometimes, but I'm rather sensitive for the ẞ issue: My last name contains an ß and uppercasing would either mean keeping the ß lowercase – the Personalausweis does that (†) and it looks ugly — or doing the ß → SS transformation which is somewhat forbidden in identity documents; a name must be exact. Hence, someday in the future, hopefully, the ẞ. While personal names were a major motivation for the inclusion of the ẞ into Unicode, I’m always happy to see it in the wild in press or book titles or such.

† Although it’s Germany and of course there exists an obscure Verwaltungsvorschrift according to which you can write the non-machine readable field of the Personalausweis/Pass in lowercase, exactly for this use case. I didn’t know that last time but I fully intend to make some poor civil servants life a slight hell the next time I have to renew.

gumby 1481 days ago

I assumed it was added for shop signs and product packaging (I.e. as a gimmick).

Speaking of surviving Fraktur ligatures, I’m sorry that a couple of others like tz didn’t make it to Roman. It makes poor ß appear lonely.

AdrianoKF 1481 days ago

I was actually wondering if the driving factor is legal documents. ID cards show names in all-caps letters, which creates the dilemma that your ID might not show your actual name (notwithstanding international standards for travel documents that prescribe transliteration of non-latin characters; see ICAO Doc 9303 Part 3, section 6 [0] for examples)

[0]: https://www.icao.int/publications/Documents/9303_p3_cons_en....

gumby 1481 days ago

That’s a good theory, especially as section 3.1 of that ICAO document explicitly permits the use of ß.

Bringing the thread back to the topic of this comment section: the ICAO document also calls the digits 0123456789 “Arabic” even though their shapes are closer to the original Hindi (Devanagari) forms than to actual Arabic digits — another “Hindi/Turkey” situation

korlja 1481 days ago

That is correct and solves the roundtrip-problem (in this case and language). But uppercase 'ẞ' is just an additional option at the discretion of the writer, the recommended variant continues to be 'SS'.

egeozcan 1481 days ago

> German ß capitalizes to SS, and .ToUpper().ToLower() should give you either 'ss' or 'ß' depending on what it was before

As long as there is no unicode SS character, we are into the "what color are your bits" problem or tolower needs to be language and word aware.

In .NET the uppercase and lowercase functions are culture aware (with defaults to system settings, which breaks more software than you might think) but not word aware AFAIK.

bee_rider 1481 days ago

> As long as there is no unicode SS character, we are into the "what color are your bits" problem or tolower needs to be language and word aware.

It turns out there is such a unicode character -- ẞ/ß -- although based on other comments here it looks like it was added fairly recently.

Upper/Lower case stuff just seems to be at an annoying intersection where it has cultural and also programming significance. Or at least, people will use toUpper when they really want some case-insensitive sortable version of the string.

(based on some googling, probably localeCompare is the way to go in javascript at least).

3836293648 1480 days ago

I hate the locale nonsense. The decimal point is `.` and not `,`. The rest of this stupid country is wrong

Hamuko 1481 days ago

>Blindly calling .ToUpper() on anything is a typical anglo-centric mistake.

Yes, one that you might make if you were for example, trying to make English text uppercase. Which is why it would be daft for anyone to suggest that their country has two different English spellings depending on the character case.

d1sxeyes 1481 days ago

.toUpper() is a quick and mostly effective way to normalise strings for comparison if you're not sure what case the two strings to compare are in (eg: one has been input by a user). Yes, it's a shortcut, and occasionally you'll end up with a miss, but it's good enough to work 99% of the time, and the alternative is a LOT of code and data changes to handle a very small proportion of cases.

vesinisa 1481 days ago

Hmm I think you miss the point. In some programming environments (like C# and Java) .toUpper() is always incorrect in code unless you are displaying the resulting string in a UI, as it uses the "current locale", which is whatever the user has selected for the machine. When e.g. comparing strings case-insensitively, you should always explicitly specify the locale where the conversion should happen instead of relying on an external configuration variable.

JavaScript actually seems to be the smart one here - its default .toUpperCase() uses the "locale-insensitive case mappings in the Unicode Character Database".

greenshackle2 1481 days ago

> the user has selected for the machine

I don't think most Java and C# software is desktop apps? Surely in most cases it's the locale selected for the server or VM, which should be consistent?

(I'm not saying it's good coding practice, mind you, but it probably ends up accidentally working in a lot of cases.)

vesinisa 1481 days ago

You write like you can know how and where the code will get executed in the future. :) Do you think that the authors of Windows 95 ever imagined the system would one day get ported to an obscure subset of a functional scripting language (Asm.js variety of JavaScript), and get booted in a hyper-text browser running on a PDA device with internet connection (web browser on a smartphone)? Yet - here we are: https://win95.ajf.me/

> I'm not saying it's good coding practice, mind you, but it probably ends up accidentally working in most cases

Fully agree. It's still bad practice and I high-five every linter that automatically flags it.

d1sxeyes 1481 days ago

I did indeed. Thanks - yes, I was referring to JavaScript's .toUpperCase(), silly oversight and assumption on my side.

Thanks for the correction!

underwater 1481 days ago

You make a good case (ha!). What if toUpper() and toLower() were omitted from standard libraries? Usually they are used, incorrectly, to do something like string comparison, which could be better served by a more specific method.

bbu 1481 days ago

Only sz should use ß. Ss stays ss even in German-german. Switzerland got rid of the sz/ss distinction a long time ago. So you need to be culture and word aware to do it „right“.

korlja 1481 days ago

'sz' for 'ß' is sometimes used to make things roundtrip-proof in capslock, e.g. on military stencils. HTML calls it 'szlig'. Also, some use "Esszet" as the name of the character. But all are wrong in that ß isn't a ligature of s and z, it is a ligature of s and s. The shape of the character stems from the fact that in fractur writing and even some grotesk fonts, 's' at the end of a word was written 's', while 's' within a word was written 'ſ'. Thus the end of a word like Fuss was written Fuſs, giving a ligature of Fuß. No 'z' anywhere.

kmm 1481 days ago

Originally ß arose as a ligature of s and z, or rather ſ and ʒ. In many older texts, or even current fonts, the second part of the ligature is indisputably a long-tailed ʒ

https://en.wikipedia.org/wiki/%C3%9F

seszett 1481 days ago

> some use "Esszet" as the name of the character

I believe the actual name is Eszett.

wanderingstan 1481 days ago

Only “wrong” in light of current usage, but not historically.

By this measure, the English name of “W” would be wrong because it’s not actually a “double-U” but a “double-V”. But at the time of the letter’s formation, U and V were not yet separate letters.

https://en.wikipedia.org/wiki/W

jfk13 1481 days ago

The Swedes get this "right", and call it [ˈdɵ̂bːɛlˌveː].

https://en.wikipedia.org/wiki/Swedish_alphabet

kaetemi 1481 days ago

In Dutch it's even more sane, the alphabet just goes V = vee, W = wee.

wanderingstan 1481 days ago

Oh wow, didn’t know that!

samatman 1481 days ago

French as well, although the elegance gained is quickly tarnished by calling y "Greek i".

mzs 1481 days ago

I always thought that German z used to look something between Ꙁ & з. ʒ looks pretty close so ſз became ß but Latin transliteration rules were ss instead. At least that's what I was taught in German class.