Hacker News new | ask | show | jobs
by amelius 676 days ago
Unicode sucks. There is a glyph for a hammer, for a screwdriver, but not a soldering iron.

During COVID, people were using a golf club as a substitute for a cotton swab.

We now have generative AI that can make any desired emoticon you can dream of, except you can't use it because of Unicode.

3 comments

Unicode is OK (flawed but OK), emoji in Unicode suck.

The worst part is Unicode breaking existing documents by retroactively converting some common symbols emoji-default, despite supposed stability guarantees.

The second worst part is the emoji combining sequences becoming an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

> The second worst part is the emoji combining sequences [...]

That was the main reason for me to pull in harfbuzz. At least it "just resolves" those sequences to glyph indices.

The goal of Unicode is to ascribe semantic, machine-readable/indexible identifiers (codepoints) to “things that appear in what are conventionally considered plaintext documents” so that “text” can be handled in a standard way across systems like screen-readers, IMEs, LLMs, search engines, etc; and so that we don’t need to depend on some particular pictorial representation / decoding of an image surviving into the future to decode it[1], because the (open, widely replicated) Unicode standard and database files encode a description for each codepoint, and a semantics for each codepoint (things like collation, joining, capitalization, etc.)

[1] Consider if your vector image representing a soldering iron is someone’s IP, and they rescind licensing for use/redistribution of it. Poof goes all the (legal) copies of your emoji, leaving future historians scratching their heads about what was supposed to be there in the text, and what meaning it contributed. (A concrete case of this actually happening — though not with vector images, but rather stock sound effects: https://roblox.fandom.com/wiki/Roblox_death_sound)

What the GP is complaining about isn't Unicode in general but Unicode Emoticons[1]. Characters can represent any thought, but emojis represent only what the committee agrees to, and the committee seems quite political.

What they should have done is to just have two characters <START_EMOTICON> and <END_EMOTICON>. And you could have text like:

    I'm not doing it <START_EMOTICON>pouting<END_EMOTICON>
If the renderer supports the "pouting" emoji it would replace the text and if it doesn't it would just render:

    I'm not doing it *pouting*
Everyone would be free to create emojis. You could pick your own emoji provider. So if the emoji doesn't exist locally it would be fetched from `http://emojiprovider.tld/pouting`. If you don't like it you can install another one.

It's ridiculous that there's no "pouting" emoji but there 6 emojis for pregnant men.

[1]: Yes, they're called emoticons in Unicode, not emojis. The term emoji entered English later.

Ya let's integrate gpu-accelerated emoji generation into unicode and require it everwhere, we can call it UTF-8B and standardize on 8GiB per character until that turns out not to be enough.
UTF-8 is innocent here, it's just a (very clever and useful) encoding of Unicode. The problem is adding junk codepoints based on current political ideology, not how they are encoded.
If you use a consortium governed character set, you get the problems of government by consortium. Probably still better to have a versioned universal set than so many to choose from.
For the record, I like utf8, I was just being silly.
> junk codepoints based on current political ideology

What do you mean ?

Maybe ability to turn skin color brown? Or maybe the poster is mad that some country flag was or wasn't included? The only other distant possibility is how some character sets turn the gun emoji into a water pistol, but that's not on the Unicode consortium. I can't think of anything else that is political in the set.
I dont want to supply munitions to the culture war, but I would like to add a counterpoint.

Adding skin tone modifiers to emojis was a bit odd to me, since I view them as signifying emotions rather than people. Maybe that's why, of the six Fitzpatrick scale[0] skin tones they drew from, the only one not added was mine.

Similarly odd to me (to the point of appearing performative) was having a male/female/ambiguous variant on every job emoji and a separate option for every two-child two-parent family "gender" permutation. That's not how I view language as working, particularly because you're not going to be able to cover all valid families that way. It makes more sense to me, if communication rather than tokenism is the goal, to have a couple of representative emojis that convey the general concept, and then specify whatever you want about the relevant people afterward.

[0] https://en.wikipedia.org/wiki/Fitzpatrick_scale

> you're not going to be able to cover all valid families that way

In fact the emoji committee backpedaled on family permutations for exactly this reason, and now recommends (exactly as you suggest) "symbolic" family glyphs and juxtaposition of existing people-emoji to describe families in detail.

You can read about it here: https://www.unicode.org/L2/L2022/22276-family-emoji-guidelin...