Hacker News new | ask | show | jobs
by tialaramex 2178 days ago
Unicode offers two ways forward when you can't decode what you have, one alternative is an exception, you just fail because you weren't able to decode something.

The other is for any code unit that won't decode you emit U+FFFD the Unicode Replacement Character and then you carry on decoding.

For humans U+FFFD makes it obvious something is wrong, it's typically visualised as a black diamond with a white question mark. And for a machine it shouldn't match parsing rules, it isn't an alphanumeric, it isn't any of the common separator or spacing characters, so it's unlikely to be of use in an attack.

1 comments

That is a reasonable approach if you know that what you are decoding is supposed to be UTF-8.

If you don't know the text encoding because there is no information to indicate it (or you don't trust that information to be correct) then you will have to guess and "decode as UTF-8 for valid UTF-8, use some legacy encoding otherwise" is a common approach (used e.g. by many text editors).