Hacker News new | ask | show | jobs
by tialaramex 1883 days ago
No. Don't do any of these things. The reason U+FFFD exists, even though ASCII has any number of fun things you can scribble in one byte is that U+FFFD specifically isn't any of the things your program probably didn't expect to appear unexpectedly after unrelated processing.

It isn't a letter, or a digit, or whitespace, or punctuation, or a word separator, or a control character, it is neither uppercase nor lowercase, it doesn't have any canonical equivalents - it's just a codepoint that exists specifically for this purpose.

As a result it's much less likely that if gibberish sneaks into your system somehow and gets turned into U+FFFD this causes something important to break elsewhere.

And when sooner or later a human is shown this text, it's very obvious that U+FFFD isn't what they expected, whether that was E-acute, a Euro currency symbol, a cat emoji or whatever else, and the human will know something went wrong and can decide if they care about that.

1 comments

This is about what can be done in SIMD. See GP and GGP. I think using ASCII SUB is a perfectly reasonable thing to do in this context. You can always further post-process the result to turn ASCII SUB into U+FFFD.