| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by darkengine 3389 days ago
	That is why I said "with moderate success". It's not 100% reliable, but mostly good enough for basic cases. Twitter, for example, used to do an NFC normalization and count codepoints to enforce the 140 charcater limit, and this was close enough to or exactly right for enforcing 140 graphemes in probably 98% of text on Twitter. They can no longer do this because some new emoji can consume 7 codepoints for a single grapheme.

2 comments

Manishearth 3389 days ago

No, it's not, it's "mostly good enough for some scripts". This attitude just ends up making software that sucks for a subset of scripts.

Twitter is not a "basic case", it's a case where the length limit is arbitrary and the specifics don't matter much anymore. Usually when you want to segment text that is not the case.

Edit: basically, my problem with your original comment is that it helps spread the misinformation that a lot of these things are "just" emoji problems, so some folks tend to ignore them since they don't care about emoji that much, and real human languages suffer.

link

mikeash 3389 days ago

So, it was always wrong, but they didn't notice until the "exceptional" cases became common enough in their own use.

This, to me, is an argument for emoji being a good thing for string-handling code. The fact that they're common means that software creators are much more likely to notice when they've written bad string-handling code.

Whatever platform you're using should have an API call for counting grapheme clusters. It may be more complex behind the scenes, but as an ordinary programmer it should be no more difficult to do it correctly than it is to do it wrong.

link

Manishearth 3389 days ago

Ironically, this cuts both ways. Existing folks who didn't care about this now care about it due to emoji. Yay. But this leads to a secondary effect where the idea that this is "just" an emoji issue is spread around, and people who would have cared about it if they knew it affected languages as well may decide to ignore it because it's "just" emoji. It pays to be clear in these situations. I'm happy that emoji exist so that programmers are finally getting bonked on the head for not doing unicode right. At the same time, I'm wary of the situation getting misinterpreted and just shifting into another local extremum of wrongness.

Another contributor to this is that programmers love to hate Unicode. It has its flaws, but in general folks attribute any trouble they have to "oh that's just unicode being broken" even though these are often fundamental issues with international text. This leads to folks ignoring things because it's "unicode's fault", even though it's not.

link