Hacker News new | ask | show | jobs
by zzo38computer 50 days ago
> A file isn't meaningful unless you know how to interpret it; that will always be true.

There are multiple levels of meaning, though; character encoding is just one part of it. For example, a text file might be plain text, or HTML, or JSON, or a C source code, etc; a binary file might be DER, or IFF, or ZIP, etc; and then there will be e.g. what kind of data a JSON or DER or IFF contains and how that level of the data is interpreted, etc.

> Cyrillic and Greek characters get two bytes, even when they are by definition identical to ASCII characters.

Whether or not they are identical to ASCII characters depends on the character set and on other things, such as what they are being used for; the definition of "identical" is not so simple as you make it seem. Unicode defines them as not identical, which is appropriate for some uses but is wrong for other uses. (Unicode also defines some characters as identical even though in some uses it would be more appropriate to treat them as not identical, too. So, Unicode is both ways bad.)

> This bloat is actually worse than the bloat you get by using UTF-8 for Japanese; Cyrillic and Greek will easily fit into one byte.

I agree with that (although I think UTF-8 should not be used for Japanese either), but it isn't because of which characters are considered "identical" or not. There are problems with Unicode in general regardless of which encoding you use.

1 comments

> ... (although I think UTF-8 should not be used for Japanese either) ...

The people putting up websites in Japanese disagree with you, it would seem. According to Wikipedia (in the Shift JIS article), as of March 2026 99% of websites in the .jp domain were in UTF-8, with only 1% being in Shift JIS.

Japan used to have two different encodings in common use, Shift JIS (usually used on Windows) and EUC-JP (more common on Unix servers). This resulted in characters being misinterpreted often enough that they coined the word mojibake to describe the phenomenon of text coming out completely garbled. These days, it seems Japanese website makers are more than happy to accept a slight inefficiency in encoding size, because what they gain from that is never having to see mojibake again.

If they are misinterpreted, it is because the character encoding is not declared properly.

I still sometimes see mojibake in Japanese web pages, but sometimes it works; if it works, it is because the character encoding is declared properly.

In my opinion, EUC-JP is a generally better encoding of JIS (especially in e.g. C source code, which should not use Shift-JIS but EUC-JP is OK), but Shift-JIS does have some benefits in some circumstances (such as making a character grid with one byte per character cell; if using Shift-JIS for a Pascal source code then you should use (* *) instead of { } for comments please).

> If they are misinterpreted, it is because the character encoding is not declared properly.

OR because the software is buggy, or making assumptions about encoding and not checking them (which also counts as "buggy", of course). You can declare the encoding all you like, it won't protect you against the stupid decisions that other people make in writing their software. (See Excel, for example).

Yes, if you declare your encoding properly, things should work. Most of the time. And if you're using any encoding that is not the worldwide default (which these days is UTF-8), then you definitely should declare the encoding. But you'll still occasionally hit badly-written software that doesn't even think about other encodings and doesn't handle them properly. The only defense against that situation, where you declare your encoding properly and it still doesn't work, is to just use the encoding that the software was written to expect, which is almost certainly the worldwide default.