| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by saltminer 1600 days ago

> WTF business do emojis have in Unicode?

Unicode didn't invent emoji, they incorporated it because they were already popular in Japan, and if they didn't incorporate it, it would greatly reduce Japanese adoption.

Keep in mind that Unicode was intended to unify all the disparate encodings that had been brewed up to support different languages and which made exchanging documents between non-English speaking countries a nightmare. The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about. And they weren't alone, of course [1].

> What we need now is a standardized, sane subset of Unicode that implementations can support while rejecting the insane scope creep that got added on top of that.

Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).

You may never need anything outside the BMP, but that doesn't make the rest of the planes worthless. Ignoring the value of including dead and nearing-extinct languages for preservation purposes (not being able to type a language will basically guarantee its extinction, with inventing a new encoding and storing text as jpgs being the only real alternatives), there are a lot of people speaking languages found in the SMP [2][3] ([2] has 83 million native speakers, for example).

[0]: https://en.wikipedia.org/wiki/Mojibake

[1]: https://segfault.kiev.ua/cyrillic-encodings/

[2]: https://en.wikipedia.org/wiki/Modi_(Unicode_block)

[3]: https://en.wikipedia.org/wiki/Chakma_(Unicode_block)

1 comments

lmm 1600 days ago

> The term "mojibake" comes to mind [0] - Japan alone had so many encodings that a slang term for text encoded with something different than what your device expected (and subsequently got rendered as nonsensical/garbled text) came about.

Mojibake was not a "Japan has too many encodings" problem. It was a "western developers assume everyone is using CP1252" problem.

> Unicode wasn't intended to be pretty. It was intended to be the one system that everyone used, and a way to increase adoption was to do some less than ideal things, like duplicate characters (so it would be easier to convert to Unicode).

Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.

link

fomine3 1600 days ago

Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.

Unicode/UTF-8 is widely adopted/recommended in Japan and there are no widely used alternative. Japanese company tend to still use SJIS but it's just laziness. Han unification isn't a problem to handle only Japanese text: just use Japanese font everywhere. To handle multiple language text, it's pain but anyway there are no alternatives.

link

lmm 1600 days ago

> Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8.

In theory it can happen with any combination of character sets, sure, but in practice every example of mojibake I've seen has been SJIS (or UTF-8) encoded text being decoded as CP1252 ("Latin-1" but that's an ambiguous term) by software that assumed the whole world used that particular western codepage. If you've got examples of SJIS vs EUC-JP confusion in the wild I'd be vaguely interested to see them (is there even anywhere that still uses EUC-JP?)

> Japanese company tend to still use SJIS but it's just laziness.

It's not just laziness; switching to unicode is a downgrade, because in practice it means you're going to get your characters rendered wrongly (Chinese style) for a certain percentage of your customers, for little clear benefit.

> To handle multiple language text, it's pain but anyway there are no alternatives.

Given that you have to build a structure that looks like a sequence of spans with language metadata attached to each one, there's not much benefit to using unicode versus letting each span specify its own encoding.

link

fomine3 1600 days ago

Maybe the guess order depends on locale reasonably. GP is my experience mainly on old days ja-JP localed Windows software. IIRC Unix software tend to not good at guess so maybe you referring them.

Nowadays I rarely see new EUC-JP contents (or I just not recognized) but still sometimes I encounter mojibake on Chrome while visiting old homepage (like once per month). For web page, anyway most modern pages (including SJIS) don't rely on guess but have <meta charset> tag so mojibake very rarely happen. For plaintext files, I still see UTF-8 file shown as SJIS on Windows Chrome.

Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale. It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?

I believe no developer want to treat foreign charset like GBK/Big-5/whatever. There are very few information. If developer can switch reading charset on a file, then they can also switch font.

link

lmm 1600 days ago

> Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale.

The issue is that realistically a certain proportion of customers are going to have the wrong locale setting or wrong default font set.

> It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?

Certainly Firefox will use a Japanese font by default for SJIS whereas it will use a generic (i.e. Chinese) font by default for UTF-8. I would expect most encoding-aware programs would do the same?

> If developer can switch reading charset on a file, then they can also switch font.

Sure, but it works both ways. And it's actually much easier for a lazy developer to ignore the font case because it's essentially only an issue for Japan. Whereas if you make a completely encoding-unaware program it will cause issues in much of Europe and all of Asia (well, it did pre-UTF8 anyway).

link

numpad0 1600 days ago

I think by far the largest contributor that coined mojibake was E-mail MTA. Some E-mail implementations assumed 7-bit ASCII for all text and dropped MSB on 8-bit SJIS/Unicode/etc, ending up as corrupt text at the receiving end. Next up was texts written in EUC(Extended UNIX Code)-JP probably by someone either running a real Unix(likely a Solaris) or early GNU/Linux, and floppies from a classic MacOS computer. Those must have defined it and various edge cases on web like header-encoding mismatch popularized it.

"Zhonghua fonts" issue is not necessarily linked to encoding, it's an issue about assuming or guessing locales - that has to be solved by adding a language identifier or by ending han unification.

link

account42 1600 days ago

> Unfortunately they undermined all that with Han Unification, with the result that it's never going to be adopted in Japan.

This is an absolute shame and there is no excuse for fixing it so that variations for unified characters can be encoded before adding unimportant things like skin tones.

link

wodenokoto 1600 days ago

> So rather than treat the issue as a rich text problem of glyph alternates, Unicode added the concept of variation selectors, first introduced in version 3.2 and supplemented in version 4.0.[10] While variation selectors are treated as combining characters, they have no associated diacritic or mark. Instead, by combining with a base character, they signal the two character sequence selects a variation (typically in terms of grapheme, but also in terms of underlying meaning as in the case of a location name or other proper noun) of the base character. This then is not a selection of an alternate glyph, but the selection of a grapheme variation or a variation of the base abstract character. Such a two-character sequence however can be easily mapped to a separate single glyph in modern fonts. Since Unicode has assigned 256 separate variation selectors, it is capable of assigning 256 variations for any Han ideograph. Such variations can be specific to one language or another and enable the encoding of plain text that includes such grapheme variations. - https://en.m.wikipedia.org/wiki/Han_unification

This is what you’re asking for, right? Control characters that designates which version of a unified character is to be displayed.

Sure looks like it exists.

link