| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by t_hozumi 4647 days ago

I think that there is still a fundamental problem of string encoding.

The problem is that decoders cannot know what encoding a byte stream was encoded in without additional information. Such information are often lost or omitted as you can see in web world.

In such a situation, what decoders can do is just guessing. This is the reason why we still suffer Mojibake.

A possible solution was to attach encoding information to a head of bytes as one or two byte.

For example:

UTF-8 = 0b00000001

UTF-16 = 0b00000002

Shift_JIS = 0b00000003

EUC-JP = 0b00000004

and so on.

Of course this is not actual and reasonable solution because everyone must switch decoder/encoder to this protocol at once.