Hacker News new | ask | show | jobs
by jerf 4704 days ago
"Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it."

I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average). The approach you describe is the generally correct approach; you need to ensure that the proper levels of escaping are being applied. Unfortunately this is nontrivial in practice, but it's still the correct solution.

The "sanitization" meme has resulted in me smacking down at least 3 commits from developers in my organization trying to "solve" XSS by scrubbing out all less than characters across all input from the user, or eliminating all quotes, apostrophes, less than, greater than, backticks (for shell interpolation problems), etc etc. Unfortunately, the problem is, these are in general all perfectly valid input values, and some of them really smack you in the face immediately. (For instance, names may contain apostrophes. You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.

(There's still some sanitization components in the resulting solution, I just don't think they are the way you should think about it. For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)

3 comments

It's a shame there's such a proximity in terminology between 'sanitize' and 'sanity check'. I wonder if that's where this whole confusion began in the first place. Yes, it is extremely unlikely that a user's given name contains a <script> tag, but there are few reasons why your sofware should really care about it on a technical level - least of all if the way you choose to care about it leads to it also complaining when someone claims their name is O'Reilly. The correct response to someone claiming their name is "'; DROP TABLE Users --" should, ideally, be to say "Are you really sure about that?" but defer to the human decision on whether it's really the right thing to do.
Relevant XKCD - http://xkcd.com/327/
> I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average).

I've had this view for a long while. I think there's a common sense to it that either clicks or it doesn't. Plus people hear/read "escape your inputs!" so often it becomes a cargo cult.

> You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.

Exactly. Whitelisting the values that can be stored in field should be done to maintain the data integrity of the field. It's not an approach to solve security problems or prevent SQL injection.

> For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)

We ran into something like this in our app as well. When displaying meta data for an object we create related objects in the dom and reference them by id. Originally the ids were generated by simply escaping the name of the raw object but that doesn't work because as you mention there are additional restrictions on what can be used in an "id" field. The solution? Hash it! Obviously that's a very specific solution as we only cared about it being unique and tied to the other object on the same page but it worked.

If you're going to accept all characters by default, be prepared to sanitize the outputs for every use, not just your website.

Maybe you will output a data dump for someone else to print mailouts. Or you'll share the user database with a vendor's web forum. Or payment processing. Or any SaaS.

No. That completely doesn't work. This is really important: You CAN'T "sanitize" for every possible use. You can not correctly figure out in advance how to represent an input, because the different possibilities are numerous and actively self-contradictory.

To "sanitize" for "every possible use" is pretty much to remove everything that isn't an ASCII letter. Even unexpected spaces can cause crazy behavior. Commas can cause CSV-injections. And you might still have length problems even so. Oh, and you still can't guarantee something won't screw up even so! https://news.ycombinator.com/item?id=6140631

You can not, at the time input comes in to a system, even pretend to know where all the data might end up, someday, given the whims of who knows whom, and who knows when. The only thing that works is for each system to correctly encode its output as needed, and if you output the correct thing and a subsequent system blows it up, it's the subsequent system's fault. You can't prevent it. You only think you can, but you're wrong.

To be clear, if you could defend against those systems messing up, I'd be willing to consider it. But you can't. It's impossible, both in theory and in practice.

There's no easy answer to writing secure code. (Though it would help a lot of people used type systems to better effect in this problem.) Filtering out certain "dirty" characters isn't an easy answer either, on the grounds that it isn't even an answer. (It turns out to often become not easy, too, because as you gradually and inevitably learn exactly how it isn't working for you, the subsequent frantically flailing addition of heuristics becomes very not easy itself. It is easier in the long run to do it correctly.)

Perhaps I was unclear, but I did not claim that there could be one single sanitized version of the data, safe for all use cases. I was saying that you have to do different sanitization for every output.
That's not called 'sanitizing', it's called 'escaping' and 'encoding'.

The byte sequence I need to store to communicate the name "Kei$ha O'Shaughnessey, Jr." in a UTF-8 JSON string literal, a UTF-8 HTML attribute, a UTF-16 bigendian CSV file, or an ISO-8859 SQL parameter, are going to be different - but so long as all the characters I need to pass are representable in all of those domains all I have to do is perform the correct escaping and encoding. At no point do I need to 'sanitize' the name. It's a name, it's not dirty.

If there are characters there that I can't represent in the target domain, then I need to handle the loss of information.

A strategy of 'escaping' assumes that the partner system does the right thing with its data. This is not always the case.

For instance, it may be perfectly fine in my system to have a user named '<script>alert("ha!")</script>'. Are you sure that's okay in your PHP-based web forum? Really sure? Every place they've ever shown a username to the user, it's well-escaped?

And even if that's true today, what about the day when someone decides to change the web forum software to something else? What about the day when someone turns on a feature that copies certain forum threads to an internal support system, also provided by a third party?

Somewhat related example anecdote: For several years, Vimeo was sending me newsletter emails addressed to "Dear Jarek_Piórkowski" (previously "Hi Jarek Pi??rkowski"). The ó that should be there shows up fine on the Vimeo website and I even cleared and re-input the name into my profile to give them a chance to re-encode it. Still continued.

I unsubscribed from the newsletter eventually.

And ó isn't even a difficult character, it's in ISO 8859-1 for crying out loud.

Perfect example. That indicates that at some point, your data passed through a system using Windows-1252 encoding.

http://www.i18nqa.com/debug/utf8-debug.html

I expect Vimeo used a Linux system to collect your data, and I bet the thing that blasts emails out is ultimately Linux as well. So the Windows-1252 bungle probably happened in a third system in between, maybe a Windows system chosen for its ease of administration by the community managers.

Not that this is relevant to data sanitization (they're just being fuckups here) but it shows how complex this can get.