| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by roland35 717 days ago
	Are there any libraries in place which can normalize all emojis down to a single symbol?

2 comments

kevindamm 717 days ago

It's a design decision. On one end, if I'm reading your question correctly, you could use 0xFFFD (the replacement character) for anything not recognized as language-specific characters in the BMP and SMPs (this can be done within practically all existing Unicode libraries by filtering on character class) which will inadvertantly filter some non-emoji symbols and doesn't really convey any information (it can even look unprofessional, it reminds me a lot of the early web during the pre-unicode growing pains of poorly implemented i18n/l11n).

There are libraries like Unidecode[0py] [0go] [0js] which convert from unicode to ASCII text that might be easiest to include in a TUI. All the ones I looked at will convert emoji to `[?]` but many other characters are converted to that, too, including unknowns.

On the other end you can keep a running list of what you mean by emoji[1] and pattern match on those characters, then substitute for a representative emoji. But it will still pose some difficulty around what to choose for the representative symbol and how to make it fit nicely within a TUI. An example of a library for pattern-matching on emoji is emoji-test-regex-pattern[2] but you can see it is based on a txt file that needs to be updated to correspond with additions to Unicode.

[0py]: https://github.com/avian2/unidecode

[0go]: (actually there are a few of these) https://pkg.go.dev/github.com/gosimple/unidecode

[0js]: https://github.com/xen0n/jsunidecode

[1]: these aren't really contiguous ranges, and opinions vary, see https://en.m.wikipedia.org/wiki/Emoji#Unicode_blocks

[2]: https://github.com/mathiasbynens/emoji-test-regex-pattern

link

estebank 717 days ago

There's a "trick" that works somewhat well for some compound emoji like "family": replace ZWJs with whitespace. Emoji width is not standardized because it depends on platform, fonts available, shell and terminal emulator, but almost no terminal supports compound emoji correctly. Because of how they were designed, most terminals will print the emoji as its component parts. If you need to do something like underline a piece of text (like rustc has to) we decompose them ourselves, and then it is a more tractable problem to know what the width of a char is (0, 1 or 2, and var width for tabs, which we just transform to a hardcoded 4—incorrect but usable). This can still be incorrect, on specific terminals, but works well enough on most.

link

Joker_vD 716 days ago

It doesn't matter; what matters is that both your (terminal-manipulating) program and terminal emulator agree on the symbols widths. Considering that they usually won't (lots of terminal emulators have their own hand-crafted, statically linked wcwidth/wcswidth functions; the readline library also has them hard-coded, by the way), it's quite frustrating.

link

ku1ik 716 days ago

This.

link