|
|
|
|
|
by lmm
1603 days ago
|
|
> Mojibake is a universal problem when multiple charset is used and there are no charset specification on metadata. Software guess charset but it's just a guess. Japanese locale software occasionally confuses Latin-1 vs SJIS but often confuses SJIS vs EUC-JP or UTF-8. In theory it can happen with any combination of character sets, sure, but in practice every example of mojibake I've seen has been SJIS (or UTF-8) encoded text being decoded as CP1252 ("Latin-1" but that's an ambiguous term) by software that assumed the whole world used that particular western codepage. If you've got examples of SJIS vs EUC-JP confusion in the wild I'd be vaguely interested to see them (is there even anywhere that still uses EUC-JP?) > Japanese company tend to still use SJIS but it's just laziness. It's not just laziness; switching to unicode is a downgrade, because in practice it means you're going to get your characters rendered wrongly (Chinese style) for a certain percentage of your customers, for little clear benefit. > To handle multiple language text, it's pain but anyway there are no alternatives. Given that you have to build a structure that looks like a sequence of spans with language metadata attached to each one, there's not much benefit to using unicode versus letting each span specify its own encoding. |
|
Nowadays I rarely see new EUC-JP contents (or I just not recognized) but still sometimes I encounter mojibake on Chrome while visiting old homepage (like once per month). For web page, anyway most modern pages (including SJIS) don't rely on guess but have <meta charset> tag so mojibake very rarely happen. For plaintext files, I still see UTF-8 file shown as SJIS on Windows Chrome.
Viewing Japanese only UTF-8 text is totally fine for Japanese localed Windows/Mac/(Linux but YMMV). So your case is to view the text on non-Japanese locale. It possibly have a problem but how SJIS solved the issue? What software switches font if it opens SJIS file? Is the app/format don't support specifying font/lang like HTML/Word?
I believe no developer want to treat foreign charset like GBK/Big-5/whatever. There are very few information. If developer can switch reading charset on a file, then they can also switch font.