|
|
|
|
|
by yetanotherjosh
3572 days ago
|
|
I mostly agree but there are cases where some definition of sanitization is the only appropriate thing. For example, if you allow users to create content with a lightweight subset of HTML for the sake of formatting control and want to render that html in your page. And in such cases, the correct way to sanitize it is not via regexps but via a DOM parser that takes user input and builds a DOM and then emits rendered html according to a whitelist of available tags/attributes. So you might argue DOM parsing isn't sanitization and so still matches your assertion, however, in general it's common and not really inaccurate to call this sanitization. |
|
The important thing is to not change information. "Sanitization" as it is commonly used means doing something that (potentially) changes information. Which is in contrast to decoding/encoding/parsing/unparsing/translation/..., which, if done correctly, change representation, but not information.
So, to make it a useful distinction, I would call anything that potentially changes the semantics of the processed data "sanitization", and avoid using the term for anything else.
So, simply parsing a string with an HTML parser, possibly checking for acceptable elements, and then serializing back into some sort of canonical form that is semantically equivalent to the input, that's perfectly fine, and I wouldn't call that sanitization, but rather validation and canonicalization.
If you simply start dropping elements, though, that's probably a bad idea, just as simply dropping "<" characters is a bad idea, because those elements presumably bear some semantic meaning, just as a "<" in a message presumably bears some semantic meaning.
Now, it is not always obvious which level of abstraction to evaluate the semantics (and thus the preservation of semantics) at. So, it might be prefectly fine, for example, to remove or replace some elements where the semantics are known and you can show that, say, removing emphasis still generally preserves the meaning of a text.
But a whitelist approach where you simply remove everything that isn't on the whitelist usually is a bad idea. If you want to have a whitelist, use it for validation, and reject anything that's not acceptable, so the user can transform their input in such a way as to avoid any constructs you don't want, while still retaining the meaning of what they are trying to say.