Hacker News new | ask | show | jobs
by socalgal2 399 days ago
Good or not, many languages choose not to support emoji, even newer ones.

The reasons are complicated but, many people prefer unicode normalization so that different forms of what appear to be the same word are considered the same word. People argue whether or not this is important but it can certainly be argued that it would be frustrating to get an error like

    let café = 1;
    café += 1;  // error, unknown identifier 'café'
The error happens in non-normalizing languages because those to idenifiers are not the same unicode.

But, choosing a normalization affects emoji as well. Worse, when new ones are added the normalization rules can change.

2 comments

I had an idea the other day to use 64-bit character codes. Then each code can be interpreted as an 8x8 bitmap. Every character would be guaranteed to have a unique bitmap representation. The bitmaps wouldn't bet used for rendering, of course -- but they could be used as a fallback if your font does not define a character. Anyway this would somewhat avoid the problem you describe because two characters that look the same visually would have the same value. Nothing I'll ever implement of course, just a thought experiment.
As someone who has worked with 8×8 fonts I can report that you'd have some surprising problems with that idea. Not only would you have problems with there being two distinct forms of letters like "a" and "g", making things over-unique; and not only is it tricky to differentiate forms that actually are not the same, because they are from two different alphabets (especially the 13 extra "mathematical" alphabets); but it's actually quite difficult to make pre-composed forms in that amount of space.

8×8 is a tight squeeze, and 16×16 works a lot better. But that would make your approach vastly more space hungry than a normalization approach using the actual Unicode code points.

* https://github.com/jdebp/unscii/tree/2.1.1f

The level of blind trust that English speakers put on non-ASCII characters support always throw me off, knowing username on Windows 11 still has to be short ASCII sequences. Surely it's not 2010 anymore and you only have to recreate the user account rather than clean re-installing Windows, but still.

Non-ASCII comments in a source code can be scary enough sometimes, unless it's for an all-Unicode system like Android or something HTML based.