| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zahlman 616 days ago

Python's strings have uppercase, lowercase and case-folding methods that don't choke on this. They don't use UTF-16 internally (they can use UCS-2 for strings whose code points will fit in that range; while a string might store code points from the surrogate-pair range, they're never interpreted as surrogate pairs, but instead as an error encoding so that e.g. invalid UTF-8 can be round-tripped) so they're never worried about surrogate pairs, and it knows a few things about localized text casing:

    >>> 'ß'.upper()
    'SS'
    >>> 'ß'.lower()
    'ß'
    >>> 'ß'.casefold()
    'ss'

There are a lot of really complicated tasks for Unicode strings. String casing isn't really one of them.

(No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

3 comments

crote 615 days ago

But that's wrong. The uppercase for "in Maßen" ("in moderate amounts") is not "IN MASSEN" ("in Massen", meaning "in massive amounts").

kccqzy 616 days ago

Still breaks on, for example, Turkish i vs İ. It's impossible to do correctly without language information.

> (No, Python can't turn 'SS' back into 'ß'. But doing that requires metadata about language that a string simply doesn't represent.)

Yes that's my point. Because in typical languages strings don't store language metadata, this is impossible to do correctly in general.

zahlman 616 days ago

I'm not seeing anything in the Swift documentation about strings carrying language metadata, either, though?

kccqzy 616 days ago

This lowercase function takes a locale argument https://developer.apple.com/documentation/foundation/nsstrin...

It looks like an old NSString method that's available in both Obj-C and Swift.

The casefold function is even older than that. https://developer.apple.com/documentation/foundation/nsstrin... Its documentation specifically includes a discussion of the Turkish İ/I issue.

tedunangst 616 days ago

But that's wrong. The upper case for ß is ẞ.

cm2187 616 days ago

C#'s "ToUpper" takes an optional CultureInfo argument if you want to play around with how to treat different languages. Again, solved problem decades ago.

account42 615 days ago

This is not a locale issue, it's a Unicode version issue. Which hightlights another problem with adding this to the base standard library.

IncreasePosts 616 days ago

That was only adopted in Germany like 7 years ago!

kccqzy 616 days ago

Well languages and conventions change. The € sign was added not that long ago and it was somewhat painful. The Chinese language uses a single character to refer to chemical elements so when IUPAC names new elements they will invent new characters. Etc.

extraduder_ire 615 days ago

Does unicode have space set aside for those new symbols to slot into? I know it's very rare, but it could get messy.

account42 615 days ago

Unicode is already messy. Chinese characters especially so due to han unificiation.

Towaway69 615 days ago

Isn't uppercase for ß just ß - i.e. it's its own uppercase character?

bratwurst3000 615 days ago

there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps. please correct me if I am wrong. If written in uppercase it should be converted to SZ or the new uppercase ß…. which my iphone doesn’t have… and converting anything to uppercase SS isn’t something germany wants …

account42 615 days ago

> there shouldn’t be an uppercase version of ß because there is no word in the german language that uses it as the first letter. the german language didnt think of allcaps.

Allcaps (and smallcaps) has always existed in signage everywhere. Before the computing age, letters where just arbitrary metal stamps -- and just whatever you could draw before that. Historically, language was not as standardized as it is today.

Towaway69 615 days ago

I don’t think that Germany wants a capital ß or the German language requires one rather technology needs one to dot the eyes and cross the tees.

account42 615 days ago

Not generally no, but some applications used it that way because of ambiguity of upppercasing ß to SS - which is why ẞ was added.

Towaway69 615 days ago

On the other hand, the German language has existed for several hundred years without having a capital ß but now it needs one?

True capitalisation has always existed but even that didn’t seem to have required a capital ß - why now?