Hacker News new | ask | show | jobs
by tn13 1674 days ago
English speaking world has developed intuition about strings due to ASCII which simply fails when it comes to Unicode and that basically explains a lot of these pitfalls.

String length when defined #2 is also fairly complex when it comes to some languages such as Hindi. There are some symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character. So when you type them out on a keyboard you have to bit two keys but only one character will appear on screen. Unicode too represents this as two separate characters but for human eye it is one.

त + या = त्या

Following code will print 4

console.log("त्या".length);

4 comments

"symbols in Hindi which are not characters and can never exist as their own character but when placed next to a character they create a new character"

a.k.a. 'ligatures', as in f+f+i -> U+fb03 'ffi'

I would consider ligatures a text rendering concept, which allows for but is distinct from the linguistic concept described by GP.

Edit: to further illustrate my point, in the ligatures I'm familiar with (including the ones in your link), the component characters exist standalone and can be used on their own, unlike GP's example.

In the example "Straße", the ß is, in fact, derived from an ancient ligature for sz. Old German fonts often had s as ſ, and z as ʒ. This ſʒ eventually became ß.

We (completely?) lost ſ and ʒ over the years, but ß was here to stay. Its usage changed heavily over time (replacing ss instead of sz), I think for the last time in the 90s (https://en.wikipedia.org/wiki/German_orthography_reform_of_1...), where we changed when to use ß and when ss.

So while we do replace ß with ss if we uppercase or have no ß available on the keyboard, no one would ever replace ß by sz (or even ſʒ) today, unless for artistic or traditional reasons.

Many people uppercase ß with lowercase ß or, for various reasons, an uppercase B. I have yet to see a real world example of an uppercase ẞ, it does not seem to exist outside of the internet. For example, "Straße" could be seen capitalized in the wild as STRAßE, STRASSE, STRABE, with Unicode it could also be STRAẞE. It would not be capitalized with sz (STRASZE) or even ſʒ (STRAſƷE – there is no uppercase ſ) – at least not in Germany. In Austria, sz seeems to be an option.

So, for most ligatures I would agree with you, but specifically ß is one of those ligatures I would call an outlier, at least in Germany.

P.S.: Maybe the ampersand (&), which is derived from ligatures of the latin "et", has sometimes similar problems, alhough on a different level, since it replaces a whole word. However, I have seen it being used as part of "etc.", as in "&c." (https://en.wiktionary.org/wiki/%26c.), so your point might also hold.

P.P.S.: I wonder why the uppercasing in the original post did not use ẞ, but I guess it is because of the rules in https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.... (link taken from the feed). The wikipedia entry says we adopted the capital ẞ in 2017 (but it is part of unicode since 2008). It also states that the replacement SZ should be used if the meaning would otherwise get lost (e.g. "in Maßen" vs. "in Massen" would both be "IN MASSEN" but mean either "in moderate amounts" or "in masses", forcing the first to be capitalized as MASZEN). I doubt any programming language or library handles this. I would not have even handled it myself in a manual setting, as it is such an extreme edge case. And I when I read it, I would stumble over it.

Swift handles this really well,

"त्या".count // 1

"त्या".unicodeScalars.count // 4

"त्या".utf8.count // 12

Javascript's minimal library is of course not great, but there are libraries which can help, e.g. grapheme-splitter, although it's not language-aware by design, so in this instance it'll return 2.

graphemeSplitter.countGraphemes("त्या") // 2

We even already had something like this in pure ASCII: "a\bc" has "length" 3 but appears as one glyph when printed (assuming your terminal interprets backspace).
This made me think of Hangul, when not using the precomposed block. What's the string length of 한글?
In the Rakudo compiler for Raku that I just tried its "chars" count using the default EGC counting is 2.