|
|
|
|
|
by tialaramex
2139 days ago
|
|
I mean, are you though? Was it "data" ? The supposed data doesn't have metadata to tell you what it means (which encoding / character set was used), so, did it actually mean anything? Smashing the bits that don't mean anything to U+FFFD leaves humans with the unmistakable evidence that something was lost here. It's not like U+FFFD doesn't scream "Hey stuff went wrong here" - it's an inverse question mark on a diamond, short of an animated GIF that says "Uh-oh" with an anvil dropping onto a cartoon character's head we can't do much more. If you're sure it's supposed to be ISO-8859-1 then sure, treating it was UTF-8 eats data, likewise if it was supposed to be KOI-8 or something. But you don't know, so, if "Give up and demand a human fix things" isn't a sensible option, which it often won't be, this is the best we can do. |
|
Yes. The fact that you couldn't interpret it doesn't mean that the consumer of your output couldn't have if you'd passed it through without going out of your way to destroy it.
There is a large, enormous class of software that cares detects specific sequences meaningful to it that exist in ASCII, and copies other parts of it input directly into its output without really caring to modify it. If you don't intentionally destroy your input, this will silently Just Work with many encodings in actual use.