|
|
|
|
|
by fomine3
1601 days ago
|
|
Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8. Unicode/UTF-8 is widely adopted/recommended in Japan and there are no widely used alternative. Japanese company tend to still use SJIS but it's just laziness. Han unification isn't a problem to handle only Japanese text: just use Japanese font everywhere. To handle multiple language text, it's pain but anyway there are no alternatives. |
|
In theory it can happen with any combination of character sets, sure, but in practice every example of mojibake I've seen has been SJIS (or UTF-8) encoded text being decoded as CP1252 ("Latin-1" but that's an ambiguous term) by software that assumed the whole world used that particular western codepage. If you've got examples of SJIS vs EUC-JP confusion in the wild I'd be vaguely interested to see them (is there even anywhere that still uses EUC-JP?)
> Japanese company tend to still use SJIS but it's just laziness.
It's not just laziness; switching to unicode is a downgrade, because in practice it means you're going to get your characters rendered wrongly (Chinese style) for a certain percentage of your customers, for little clear benefit.
> To handle multiple language text, it's pain but anyway there are no alternatives.
Given that you have to build a structure that looks like a sequence of spans with language metadata attached to each one, there's not much benefit to using unicode versus letting each span specify its own encoding.