Hacker News new | ask | show | jobs
by sehrope 4704 days ago
In our app we neither validate nor escape user strings for any free form text (eg. "names" and descriptions)[1]. We only validate the max length.

If text is truly free form then you don't need to validate or white list anything. Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it. That combined with using prepared statements with bind variables (aka named parameters) and you don't have any issues with user inputs.

One other benefit of this approach is that you end up with proper i18n support without doing anything special. From your apps perspective all text is the same. If user's want to use unicode characters or put html tags in their descriptions then let them. If you escape it then there's no XSS issue. Plus it's WYSIWYG[2] from a user's perspective.

Who am I to judge that a user putting "<script>alert('Haxors!');</script>" as the name of an object is a bad idea?

[1]: "Names" don't include usernames which generally should have a whitelisted character set (ex: ASCII [a-z][a-z0-9+]) or email addresses (use a a real validator ... not a regex!).

[2]: https://en.wikipedia.org/wiki/Wysiwyg

4 comments

"Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it."

I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average). The approach you describe is the generally correct approach; you need to ensure that the proper levels of escaping are being applied. Unfortunately this is nontrivial in practice, but it's still the correct solution.

The "sanitization" meme has resulted in me smacking down at least 3 commits from developers in my organization trying to "solve" XSS by scrubbing out all less than characters across all input from the user, or eliminating all quotes, apostrophes, less than, greater than, backticks (for shell interpolation problems), etc etc. Unfortunately, the problem is, these are in general all perfectly valid input values, and some of them really smack you in the face immediately. (For instance, names may contain apostrophes. You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.

(There's still some sanitization components in the resulting solution, I just don't think they are the way you should think about it. For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)

It's a shame there's such a proximity in terminology between 'sanitize' and 'sanity check'. I wonder if that's where this whole confusion began in the first place. Yes, it is extremely unlikely that a user's given name contains a <script> tag, but there are few reasons why your sofware should really care about it on a technical level - least of all if the way you choose to care about it leads to it also complaining when someone claims their name is O'Reilly. The correct response to someone claiming their name is "'; DROP TABLE Users --" should, ideally, be to say "Are you really sure about that?" but defer to the human decision on whether it's really the right thing to do.
Relevant XKCD - http://xkcd.com/327/
> I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average).

I've had this view for a long while. I think there's a common sense to it that either clicks or it doesn't. Plus people hear/read "escape your inputs!" so often it becomes a cargo cult.

> You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.

Exactly. Whitelisting the values that can be stored in field should be done to maintain the data integrity of the field. It's not an approach to solve security problems or prevent SQL injection.

> For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)

We ran into something like this in our app as well. When displaying meta data for an object we create related objects in the dom and reference them by id. Originally the ids were generated by simply escaping the name of the raw object but that doesn't work because as you mention there are additional restrictions on what can be used in an "id" field. The solution? Hash it! Obviously that's a very specific solution as we only cared about it being unique and tied to the other object on the same page but it worked.

If you're going to accept all characters by default, be prepared to sanitize the outputs for every use, not just your website.

Maybe you will output a data dump for someone else to print mailouts. Or you'll share the user database with a vendor's web forum. Or payment processing. Or any SaaS.

No. That completely doesn't work. This is really important: You CAN'T "sanitize" for every possible use. You can not correctly figure out in advance how to represent an input, because the different possibilities are numerous and actively self-contradictory.

To "sanitize" for "every possible use" is pretty much to remove everything that isn't an ASCII letter. Even unexpected spaces can cause crazy behavior. Commas can cause CSV-injections. And you might still have length problems even so. Oh, and you still can't guarantee something won't screw up even so! https://news.ycombinator.com/item?id=6140631

You can not, at the time input comes in to a system, even pretend to know where all the data might end up, someday, given the whims of who knows whom, and who knows when. The only thing that works is for each system to correctly encode its output as needed, and if you output the correct thing and a subsequent system blows it up, it's the subsequent system's fault. You can't prevent it. You only think you can, but you're wrong.

To be clear, if you could defend against those systems messing up, I'd be willing to consider it. But you can't. It's impossible, both in theory and in practice.

There's no easy answer to writing secure code. (Though it would help a lot of people used type systems to better effect in this problem.) Filtering out certain "dirty" characters isn't an easy answer either, on the grounds that it isn't even an answer. (It turns out to often become not easy, too, because as you gradually and inevitably learn exactly how it isn't working for you, the subsequent frantically flailing addition of heuristics becomes very not easy itself. It is easier in the long run to do it correctly.)

Perhaps I was unclear, but I did not claim that there could be one single sanitized version of the data, safe for all use cases. I was saying that you have to do different sanitization for every output.
That's not called 'sanitizing', it's called 'escaping' and 'encoding'.

The byte sequence I need to store to communicate the name "Kei$ha O'Shaughnessey, Jr." in a UTF-8 JSON string literal, a UTF-8 HTML attribute, a UTF-16 bigendian CSV file, or an ISO-8859 SQL parameter, are going to be different - but so long as all the characters I need to pass are representable in all of those domains all I have to do is perform the correct escaping and encoding. At no point do I need to 'sanitize' the name. It's a name, it's not dirty.

If there are characters there that I can't represent in the target domain, then I need to handle the loss of information.

A strategy of 'escaping' assumes that the partner system does the right thing with its data. This is not always the case.

For instance, it may be perfectly fine in my system to have a user named '<script>alert("ha!")</script>'. Are you sure that's okay in your PHP-based web forum? Really sure? Every place they've ever shown a username to the user, it's well-escaped?

And even if that's true today, what about the day when someone decides to change the web forum software to something else? What about the day when someone turns on a feature that copies certain forum threads to an internal support system, also provided by a third party?

Somewhat related example anecdote: For several years, Vimeo was sending me newsletter emails addressed to "Dear Jarek_Piórkowski" (previously "Hi Jarek Pi??rkowski"). The ó that should be there shows up fine on the Vimeo website and I even cleared and re-input the name into my profile to give them a chance to re-encode it. Still continued.

I unsubscribed from the newsletter eventually.

And ó isn't even a difficult character, it's in ISO 8859-1 for crying out loud.

Perfect example. That indicates that at some point, your data passed through a system using Windows-1252 encoding.

http://www.i18nqa.com/debug/utf8-debug.html

I expect Vimeo used a Linux system to collect your data, and I bet the thing that blasts emails out is ultimately Linux as well. So the Windows-1252 bungle probably happened in a third system in between, maybe a Windows system chosen for its ease of administration by the community managers.

Not that this is relevant to data sanitization (they're just being fuckups here) but it shows how complex this can get.

Just to be a bit pedantic, unfortunately you don't get "proper i18n support" just by putting everything in UTF-8.

Unicode lets you represent lots of abstract characters, from different languages and societies, in one character set. That doesn't quite tell you how to render the characters. For that, you need to know what language the text is in. Unicode wants you to provide that information out-of-band, e.g. in an HTML "lang" attribute, which the renderer can use to paint the proper glyphs.

For example, the Arabic digits 4 through 7 (۴ U+06F4 .. ۷ U+06F7) have different glyphs in Persian, Sindhi, and Urdu. And a character like 直 (U+76F4) has Chinese and Japanese glyphs that may not be mutually recognizable.

Bottom line: if you want an internationalized system that can store and render multilingual text, storing the text in Unicode is a good start, but you will need to store additional info (like the language) to be able to properly render the text.

I found http://en.wikipedia.org/wiki/Eastern_Arabic_numerals which shows examples of the differences in those numerals, but it looks like the different representations have different Unicode codepoint. So, there's no need for the lang attribute. (The page uses them, but if you take them off there's no difference in the display.)

You probably need to know the language to do things like sorting, comparison, regex, etc. But if you're just storing and displaying user-entered strings and your software has no need to understand the meaning of the strings, I think it's enough to do what the parent says.

Not quite. The Wikipedia article shows the difference between U+0660 .. U+0669 (Arabic-Indic digits) on the top row and U+06F0 .. U+06F9 (Eastern Arabic-Indic digits) on the bottom row.

But what I'm talking about are the different glyphs used to represent the bottom row (U+06F0 .. U+06F9) depending on whether the text is in Persian, Sindhi, or Urdu. See http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf, table 8-2.

There is also the issue I mentioned about Chinese vs. Japanese glyphs for the same coded character, which is at least as important in practice.

This is an issue with CJK characters and probably just one more reason why UTF-8 adoption has been slow where JIS is good enough.
Regarding [1], in the favour of regexps: http://en.wikipedia.org/wiki/Regular_language

If you can't use a regexp to recognize the general case of email addresses, no finite automaton can..

Yes, but there is a point at which it's better to just hand-write some code which is equivalent to the automaton, rather than trying to use a regexp.

This is what a proper email-validation regexp looks like: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

> If you escape it then there's no XSS issue.

Not XSS, but you need to be careful about allowing through things like the LTR/RTL override characters.