| In our app we neither validate nor escape user strings for any free form text (eg. "names" and descriptions)[1]. We only validate the max length. If text is truly free form then you don't need to validate or white list anything. Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it. That combined with using prepared statements with bind variables (aka named parameters) and you don't have any issues with user inputs. One other benefit of this approach is that you end up with proper i18n support without doing anything special. From your apps perspective all text is the same. If user's want to use unicode characters or put html tags in their descriptions then let them. If you escape it then there's no XSS issue. Plus it's WYSIWYG[2] from a user's perspective. Who am I to judge that a user putting "<script>alert('Haxors!');</script>" as the name of an object is a bad idea? [1]: "Names" don't include usernames which generally should have a whitelisted character set (ex: ASCII [a-z][a-z0-9+]) or email addresses (use a a real validator ... not a regex!). [2]: https://en.wikipedia.org/wiki/Wysiwyg |
I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average). The approach you describe is the generally correct approach; you need to ensure that the proper levels of escaping are being applied. Unfortunately this is nontrivial in practice, but it's still the correct solution.
The "sanitization" meme has resulted in me smacking down at least 3 commits from developers in my organization trying to "solve" XSS by scrubbing out all less than characters across all input from the user, or eliminating all quotes, apostrophes, less than, greater than, backticks (for shell interpolation problems), etc etc. Unfortunately, the problem is, these are in general all perfectly valid input values, and some of them really smack you in the face immediately. (For instance, names may contain apostrophes. You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.
(There's still some sanitization components in the resulting solution, I just don't think they are the way you should think about it. For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)