Hacker News new | ask | show | jobs
by jerf 4704 days ago
No. That completely doesn't work. This is really important: You CAN'T "sanitize" for every possible use. You can not correctly figure out in advance how to represent an input, because the different possibilities are numerous and actively self-contradictory.

To "sanitize" for "every possible use" is pretty much to remove everything that isn't an ASCII letter. Even unexpected spaces can cause crazy behavior. Commas can cause CSV-injections. And you might still have length problems even so. Oh, and you still can't guarantee something won't screw up even so! https://news.ycombinator.com/item?id=6140631

You can not, at the time input comes in to a system, even pretend to know where all the data might end up, someday, given the whims of who knows whom, and who knows when. The only thing that works is for each system to correctly encode its output as needed, and if you output the correct thing and a subsequent system blows it up, it's the subsequent system's fault. You can't prevent it. You only think you can, but you're wrong.

To be clear, if you could defend against those systems messing up, I'd be willing to consider it. But you can't. It's impossible, both in theory and in practice.

There's no easy answer to writing secure code. (Though it would help a lot of people used type systems to better effect in this problem.) Filtering out certain "dirty" characters isn't an easy answer either, on the grounds that it isn't even an answer. (It turns out to often become not easy, too, because as you gradually and inevitably learn exactly how it isn't working for you, the subsequent frantically flailing addition of heuristics becomes very not easy itself. It is easier in the long run to do it correctly.)

1 comments

Perhaps I was unclear, but I did not claim that there could be one single sanitized version of the data, safe for all use cases. I was saying that you have to do different sanitization for every output.
That's not called 'sanitizing', it's called 'escaping' and 'encoding'.

The byte sequence I need to store to communicate the name "Kei$ha O'Shaughnessey, Jr." in a UTF-8 JSON string literal, a UTF-8 HTML attribute, a UTF-16 bigendian CSV file, or an ISO-8859 SQL parameter, are going to be different - but so long as all the characters I need to pass are representable in all of those domains all I have to do is perform the correct escaping and encoding. At no point do I need to 'sanitize' the name. It's a name, it's not dirty.

If there are characters there that I can't represent in the target domain, then I need to handle the loss of information.

A strategy of 'escaping' assumes that the partner system does the right thing with its data. This is not always the case.

For instance, it may be perfectly fine in my system to have a user named '<script>alert("ha!")</script>'. Are you sure that's okay in your PHP-based web forum? Really sure? Every place they've ever shown a username to the user, it's well-escaped?

And even if that's true today, what about the day when someone decides to change the web forum software to something else? What about the day when someone turns on a feature that copies certain forum threads to an internal support system, also provided by a third party?