Hacker News new | ask | show | jobs
by darkengine 3386 days ago
The thing that frustrates me the most about Unicode emoji is the astounding number of combining characters. For combining characters in written languages, you can do an NFC normalization and, with moderate success, get a 1 codepoint = 1 grapheme mapping, but "Emoji 2.0" introduced some ridiculous emoji compositions with the ZWJ character.

To use the author's example:

‍woman - 1 codepoint

black woman - 2 codepoints, woman + dark Fitzpatrick modifier

‍️‍‍woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + woman

It's like composing Mayan pictographs, except you have to include an invisible character in between each component.

Here's another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪 🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷

edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single "character".

8 comments

> The thing that frustrates me the most about Unicode emoji is the astounding number of combining characters. For combining characters in written languages, you can do an NFC normalization and, with moderate success, get a 1 codepoint = 1 grapheme mapping, but "Emoji 2.0" introduced some ridiculous emoji compositions with the ZWJ character.

No, no, no, NO.

Please stop spreading this bit of misinformation. Emoji has not changed this situation.

Hangul is one where NFC won't work well. Yes, we actually already encode all possible modern hangul syllable blocks in NFC form as well, but this ignores characters with double choseongs or double jungseongs that can be found in older text. Which you sometimes see in modern text, actually.

All Indic scripts (well, all scripts derived from Brahmi, so this includes many scripts from Southeast Asia as well like Thai) would have trouble doing the NFC thing.

I am annoyed that the unicode spec introduced more complexity into their algorithms to support Unicode, but this is because they could have achieved mostly the same task by not introducing emoji-specific complexity and reusing features that existing scripts already have and have already been accounted for.

That is why I said "with moderate success". It's not 100% reliable, but mostly good enough for basic cases. Twitter, for example, used to do an NFC normalization and count codepoints to enforce the 140 charcater limit, and this was close enough to or exactly right for enforcing 140 graphemes in probably 98% of text on Twitter. They can no longer do this because some new emoji can consume 7 codepoints for a single grapheme.
No, it's not, it's "mostly good enough for some scripts". This attitude just ends up making software that sucks for a subset of scripts.

Twitter is not a "basic case", it's a case where the length limit is arbitrary and the specifics don't matter much anymore. Usually when you want to segment text that is not the case.

Edit: basically, my problem with your original comment is that it helps spread the misinformation that a lot of these things are "just" emoji problems, so some folks tend to ignore them since they don't care about emoji that much, and real human languages suffer.

So, it was always wrong, but they didn't notice until the "exceptional" cases became common enough in their own use.

This, to me, is an argument for emoji being a good thing for string-handling code. The fact that they're common means that software creators are much more likely to notice when they've written bad string-handling code.

Whatever platform you're using should have an API call for counting grapheme clusters. It may be more complex behind the scenes, but as an ordinary programmer it should be no more difficult to do it correctly than it is to do it wrong.

Ironically, this cuts both ways. Existing folks who didn't care about this now care about it due to emoji. Yay. But this leads to a secondary effect where the idea that this is "just" an emoji issue is spread around, and people who would have cared about it if they knew it affected languages as well may decide to ignore it because it's "just" emoji. It pays to be clear in these situations. I'm happy that emoji exist so that programmers are finally getting bonked on the head for not doing unicode right. At the same time, I'm wary of the situation getting misinterpreted and just shifting into another local extremum of wrongness.

Another contributor to this is that programmers love to hate Unicode. It has its flaws, but in general folks attribute any trouble they have to "oh that's just unicode being broken" even though these are often fundamental issues with international text. This leads to folks ignoring things because it's "unicode's fault", even though it's not.

I actually prefer the ZWJ approach to just randomly combining multiple symbols into one, like with the country flags. With ZWJ you at least have a chance to reliably detect such a combined grapheme, as opposed to keeping long lists of special cases that might or might not be implemented by your OS / rendering engine / font.
Man, imagine if you could compose chinese characters out of radicals like this.

I'm not sure if that would be a good thing or a bad thing.

The one that still surprises me is Hangul (Korean script). Hangul characters are made of 24 basic characters (jamo) which represent consonant and vowel sounds, which are composed into Hangul characters representing syllables.

Unicode has a block for Hangul jamo, but they aren't used in typical text. Instead, Hangul are presented using a massive 11K-codepoint block of every possible precomposed syllable. ¯\_(ツ)_/¯

I believe that was a necessary compromise to use Hangul on any software not authored by Koreans.

"These are characters from a country you've never been to. Each three-byte sequence (assuming UTF-8) corresponds to a square-shaped character." --> Easy for everyone to understand, and less chance of screwup (as long as the software supports any Unicode at all).

"These should be decomposed into sequences of two or three characters, each three bytes long, and then you need a special algorithm to combine them into a square block." --> This pretty much means the software must be developed with Korean users in mind (or someone must heroically go through every part of the code dealing with displaying text), otherwise we might as well assume that it's English-only.

Well, now the equation might be different, as more and more software are developed by global companies and there are more customers using scripts with complicated combining diacritics, but that wasn't the case when Hangul was added to Unicode.

For example: if NFD works properly, the first two characters below should look identical, and the third should show a "defective" character that looks like the first two except without the circle (ㅇ). It doesn't work in gvim (it fails to consider the second/third example as a single character), Chrome in Linux, or Firefox in Linux.

은 은 ᅟᅳᆫ

Of course, if it were the only method of encoding Korean, then the support would have been better, but it would've still required a lot of work by everyone.

My Linux Chrome shows your example perfectly though. Note that I also have CJK language pack installed.
My Linux Firefox also behaves correctly, but I don't have any language packs installed AFAIA.
Windows 10 latest chrome shows it properly, fwiw.
The original version of Unicode was primarily intended to unify all existing character sets as opposed to designing a character database from fundamental writing script principles. That's why most of the Latin accented characters (e.g., à) come in precomposed form.

It is worth noting that precomposed Hangul syllables decompose to the Jamo characters under NFD (and vice versa for NFC). However, most data is sent and used with NFC normalization.

This is primarily because the legacy character set---KS X 1001---already contained tons (2,350 to be exact) of precomposed syllables. Unicode 1.0 and 1.1 had lots of syllables encoded in this way, with no good way to figure out the pattern, and in 2.0 the entire Hangul syllable block is reallocated to a single block of 11,172 correctly [1] ordered syllables.

So yeah, Unicode is not a problem here (the compatibility with existing character sets was essential for Unicode's success), it's a problem of legacy character sets :-)

[1] Only correct for South Koreans though :) but the pattern is now very regular and it's much more efficient than heavy table lookups.

I would imagine this is a legacy from the Good Old Days when every Asian locale had its own encoding. Unicode imported the Hangul block from ISO-2022-KR/Windows-949 (different encodings of the same charset), which has only Hangul syllables.
The ideographic description characters do provide a way to describe how to map radicals into characters, but don't actually provide rendering in such a manner.

There is active discussion on actually being able to build up complex grapheme clusters in such a manner, because it's necessary for Egyptian and Mayan text to be displayed properly. U+13430 and U+13431 have been accepted for Unicode 10.0 already for some Egyptian quadrat construction.

Doesn't it already exist to an extent? That's pretty much how Korean is built isn't it?
Korean doesn't use IDSes, it's a fixed algorithm (not specced by unicode, but a fixed algorithm) for combining jamos into a syllable block. Korean syllable blocks are made up of a fixed set of components.

IDSes let you basically do arbitrary table layout with arbitrary CJK ideographs, which is very very different. With Hangul I can say "display these three jamos in a syllable block", and I have no control over how they get placed in the block -- I just rely on the fact that there's basically one way to do it (for modern korean, archaic text is a bit more complicated and idk how it's done) and the font will do it that way.

With IDS I can say "okay, display these two glyphs side-by-side, place them under this third glyphs, place this aggregate next to another aggregate made up of two side-by-side glyphs, and surround this resulting aggregate with this glyphs". Well, I can't, because I can't say the word display there; IDS is for describing chars that can't be encoded, but isn't supposed to really be rendered. But it could be, and that's a vastly different thing from what existing scripts like Hangul and Indic scripts let you do when it comes to glyph-combining.

Jamo, Emoji (including flag combinators), Arabic, and Indic scripts all combine according on effectively per-character basis. There's not really any existing character that says "display any Unicode grapheme A and grapheme B in the same visual cell with A above B." The proposed additions to Egyptian hieroglyphs would be the first addition of such a generic positioning control character to my knowledge, albeit perhaps limited just to characters in the Egyptian Unicode repertoire.

Research on what to do vis à vis Mayan characters (including perhaps reusing Egyptian control characters for layout) is still ongoing, as is better handling of Egyptian.

https://en.wikipedia.org/wiki/Cangjie_input_method

This isn't at the textual level, and the components are not strictly radicals, but this may interest you.

Somebody involved with Unicode must have had the same idea, because the ideographic description characters exist. However, I've never seen them used in practice because they don't actually render the character. You just get something like ⿰扌足, which corresponds to 捉.

https://en.wikipedia.org/wiki/Ideographic_Description_Charac...

They're not supposed to render, it's purely for describing text.

As are the interlinear ruby annotations.

I'm curious. Are interlinear ruby annotation codepoints actually used for their intended purpose anywhere? And what are you supposed to do when you encounter one?

I know that they appear on Unicode's shitlist in [UTR#20], a proposed tech report that contained a table of codepoints that should not be used in text meant for public consumption. UTR#20 suggested things you could do when you encounter these codepoints, but it was withdrawn, leaving the status of these codepoints rather confused.

[UTR#20]: http://www.unicode.org/reports/tr20/tr20-9.html#Interlinear

You could just read PropLists.txt to find the list of characters with the Deprecated property:

    0149          ; Deprecated # L&       LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
    0673          ; Deprecated # Lo       ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
    0F77          ; Deprecated # Mn       TIBETAN VOWEL SIGN VOCALIC RR
    0F79          ; Deprecated # Mn       TIBETAN VOWEL SIGN VOCALIC LL
    17A3..17A4    ; Deprecated # Lo   [2] KHMER INDEPENDENT VOWEL QAQ..KHMER INDEPENDENT VOWEL QAA
    206A..206F    ; Deprecated # Cf   [6] INHIBIT SYMMETRIC SWAPPING..NOMINAL DIGIT SHAPES
    2329          ; Deprecated # Ps       LEFT-POINTING ANGLE BRACKET
    232A          ; Deprecated # Pe       RIGHT-POINTING ANGLE BRACKET
    E0001         ; Deprecated # Cf       LANGUAGE TAG
(note that the ruby annotation codepoints aren't on that list).

The use in XML/HTML is no longer maintained by Unicode, it is maintained by the W3C instead: https://www.w3.org/TR/unicode-xml/.

> Are interlinear ruby annotation codepoints actually used for their intended purpose anywhere?

Yes

> And what are you supposed to do when you encounter one?

Nothing. Don't display them, or display some symbolic representation. You probably shouldn't make ruby happen here; if your text is intended to be rendered correctly use a markup language.

----------

Unicode is ultimately a system for describing text. Not all stored text is intended to be rendered. This is why it has things like lacuna characters and other things.

So when you come across some text using ruby, or some text with an unencodable glyph, what do you do? You use ruby annotations or IDS respectively. It lets you preserve the nature of the text without losing info.

(Ruby is inside unicode instead of being completely deferred to markup since it is used often enough in Japanese text, especially whenever an irregular (not out of the "common" list) kanji is used. You're supposed to use markup if you actually want it rendered, but if you just wanted to store the text of a manuscript you can use ruby annotations)

Can you give an example of text in the wild that uses interlinear ruby annotation codepoints? Because I searched the Common Crawl for them, and every occurrence of U+FFF9 through U+FFFB seems to have been an accident that has nothing to do with Japanese.

Note that I didn't actually ask you about rendering.

I care from the point of view of the base level of natural language processing. Some decisions that have nothing to do with rendering are:

- Do they count as graphemes?

- What do you do when you feed text containing ruby characters to a Japanese word segmenter (which is not going to be okay with crazy Unicode control characters, even those intended for Japanese)?

- Could they appear in the middle of a phrase you would reasonably search for? Should that phrase then be searchable without the ruby? Should the contents of the ruby also be searchable?

Seeing how ruby codepoints are actually used would help to decide how to process them. But as far as I can tell, they're not actually used (markup is used instead, quite reasonably). So I'm surprised that your answer is a flat "Yes".

> woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips + ZWJ + woman

LOL, I knew about the crazy flag characters, but I had no idea it was this bad. Does "woman + ZWJ + heart + ZWJ + hands pressed together + ZWJ + woman" become "2 women lovingly holding hands"? Unicode has become completely absurd, and I am grateful every day that I'm not one of the poor coders having to implement it.

Humorously enough, input methods have not advanced at all. To type any of these things, I need to open a character picker, then either type in the character's name if I know it, or scroll through pages of symbols until I find the one I want. Yet we still call this "text."

>"edit: looks like HN strips emoji? Changed the emoji in the example into English words. They are all supposed to render as a single "character"."

Ironic huh, that we're discussing this on a site which has "solved" this by just pretending it doesn't exist :)

We're still on that IPv4 unicode, that Python 2.7 world, where we use scanf() to read user input and goto wherever we please.

That's all well and good, but at the end of the day, some unknowing developer has to write this functionality into whatever input-related code for some program that doesn't use OS-level components, and it just creates a mess.
What frustrates you about it? I understand the complexity of compositions, but do you have any interesting stories to share about the complexity causing problems for you?
> It's like composing Mayan pictographs

Which reminds me: We need a Jaguar emoji!