| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tialaramex 2139 days ago

I mean, are you though? Was it "data" ? The supposed data doesn't have metadata to tell you what it means (which encoding / character set was used), so, did it actually mean anything?

Smashing the bits that don't mean anything to U+FFFD leaves humans with the unmistakable evidence that something was lost here. It's not like U+FFFD doesn't scream "Hey stuff went wrong here" - it's an inverse question mark on a diamond, short of an animated GIF that says "Uh-oh" with an anvil dropping onto a cartoon character's head we can't do much more.

If you're sure it's supposed to be ISO-8859-1 then sure, treating it was UTF-8 eats data, likewise if it was supposed to be KOI-8 or something. But you don't know, so, if "Give up and demand a human fix things" isn't a sensible option, which it often won't be, this is the best we can do.

1 comments

naniwaduni 2138 days ago

> Was it "data" ?

Yes. The fact that you couldn't interpret it doesn't mean that the consumer of your output couldn't have if you'd passed it through without going out of your way to destroy it.

There is a large, enormous class of software that cares detects specific sequences meaningful to it that exist in ASCII, and copies other parts of it input directly into its output without really caring to modify it. If you don't intentionally destroy your input, this will silently Just Work with many encodings in actual use.

link

tialaramex 2138 days ago

But this involves a sleight of hand where, at first, you deny knowing the encoding so as to require we can't decode it, and then, you declare you did know the encoding so as to declare the results "destroyed" because now you can't decode them.

Just tell the text processing tool the encoding. Or, don't use text.

If you resent having to pick an encoding, it still works - just the encoding is UTF-8 because duh, of course it is.

link

naniwaduni 2138 days ago

In a world where you design your entire data processing pipeline from the ground up for each process, sure, you the person who knows the encoding of the program, and the program itself, have the same knowledge of text encodings. You also wouldn't have this problem in the first place.

In practice, if this comes up at all it's a huge mistake to be destroying your data. Curse the person who wrote the tool in the middle that eats data, and don't be them.

Even outright throwing errors is better than replacing characters and expecting whoever looks at the data on the other end to notice.

link

tialaramex 2138 days ago

If you want an error, throw an error. We specified that up front. The entire sub-thread you're in is about what happens if you for whatever reason can't throw an error or don't want to.

Notice that, again if you want an error you can detect U+FFFD and error out on that. I mean apparently this isn't Unicode after all right? So the only way U+FFFD got into the pipeline is because of an error you've now decided you should have caught but... didn't?

Your approach randomly introduces unspecified behaviour which is likely to introduce security vulnerabilities and who knows what other problems because it resists "Full Recognition Before Processing".

Unlike treating text in unknown encoding as UTF-8, passing it through mangled by tools that didn't actually understand it as you've proposed does lead to real world vulnerabilities that can be as serious as remote code execution.

link