Hacker News new | ask | show | jobs
by ynik 2178 days ago
The bytestring was truncated after 32 bytes, in the middle of a UTF-8 byte sequence. This means the resulting truncated string is not valid UTF-8 anymore. So my guess is that most devices decide "if it's not valid UTF-8, it must $LEGACY_ENCODING".
2 comments

Unicode offers two ways forward when you can't decode what you have, one alternative is an exception, you just fail because you weren't able to decode something.

The other is for any code unit that won't decode you emit U+FFFD the Unicode Replacement Character and then you carry on decoding.

For humans U+FFFD makes it obvious something is wrong, it's typically visualised as a black diamond with a white question mark. And for a machine it shouldn't match parsing rules, it isn't an alphanumeric, it isn't any of the common separator or spacing characters, so it's unlikely to be of use in an attack.

That is a reasonable approach if you know that what you are decoding is supposed to be UTF-8.

If you don't know the text encoding because there is no information to indicate it (or you don't trust that information to be correct) then you will have to guess and "decode as UTF-8 for valid UTF-8, use some legacy encoding otherwise" is a common approach (used e.g. by many text editors).

I cannot believe I did not notice that. I will rerun all of my testing with a valid UTF-8 byte sequence :)