Hacker News new | ask | show | jobs
by naniwaduni 2137 days ago
In a world where you design your entire data processing pipeline from the ground up for each process, sure, you the person who knows the encoding of the program, and the program itself, have the same knowledge of text encodings. You also wouldn't have this problem in the first place.

In practice, if this comes up at all it's a huge mistake to be destroying your data. Curse the person who wrote the tool in the middle that eats data, and don't be them.

Even outright throwing errors is better than replacing characters and expecting whoever looks at the data on the other end to notice.

1 comments

If you want an error, throw an error. We specified that up front. The entire sub-thread you're in is about what happens if you for whatever reason can't throw an error or don't want to.

Notice that, again if you want an error you can detect U+FFFD and error out on that. I mean apparently this isn't Unicode after all right? So the only way U+FFFD got into the pipeline is because of an error you've now decided you should have caught but... didn't?

Your approach randomly introduces unspecified behaviour which is likely to introduce security vulnerabilities and who knows what other problems because it resists "Full Recognition Before Processing".

Unlike treating text in unknown encoding as UTF-8, passing it through mangled by tools that didn't actually understand it as you've proposed does lead to real world vulnerabilities that can be as serious as remote code execution.