| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yetanotherjosh 3572 days ago
	I mostly agree but there are cases where some definition of sanitization is the only appropriate thing. For example, if you allow users to create content with a lightweight subset of HTML for the sake of formatting control and want to render that html in your page. And in such cases, the correct way to sanitize it is not via regexps but via a DOM parser that takes user input and builds a DOM and then emits rendered html according to a whitelist of available tags/attributes. So you might argue DOM parsing isn't sanitization and so still matches your assertion, however, in general it's common and not really inaccurate to call this sanitization.

1 comments

zAy0LfpBZLC8mAC 3572 days ago

Well, it depends ... ;-)

The important thing is to not change information. "Sanitization" as it is commonly used means doing something that (potentially) changes information. Which is in contrast to decoding/encoding/parsing/unparsing/translation/..., which, if done correctly, change representation, but not information.

So, to make it a useful distinction, I would call anything that potentially changes the semantics of the processed data "sanitization", and avoid using the term for anything else.

So, simply parsing a string with an HTML parser, possibly checking for acceptable elements, and then serializing back into some sort of canonical form that is semantically equivalent to the input, that's perfectly fine, and I wouldn't call that sanitization, but rather validation and canonicalization.

If you simply start dropping elements, though, that's probably a bad idea, just as simply dropping "<" characters is a bad idea, because those elements presumably bear some semantic meaning, just as a "<" in a message presumably bears some semantic meaning.

Now, it is not always obvious which level of abstraction to evaluate the semantics (and thus the preservation of semantics) at. So, it might be prefectly fine, for example, to remove or replace some elements where the semantics are known and you can show that, say, removing emphasis still generally preserves the meaning of a text.

But a whitelist approach where you simply remove everything that isn't on the whitelist usually is a bad idea. If you want to have a whitelist, use it for validation, and reject anything that's not acceptable, so the user can transform their input in such a way as to avoid any constructs you don't want, while still retaining the meaning of what they are trying to say.

link

yetanotherjosh 3572 days ago

I hear what you're saying and it represents an ideal. But there are circumstances where information really has to be removed. Perhaps because the user is no longer present and it was collected under circumstances that had more liberal validation. Or because you're handing information across a boundary of implementation ownership and can't trust the receiver to handle potentially dangerous information correctly. I agree that sanitization (in the sense of stripping bits out of a user data payload according to some security rules) shouldn't be the first tool in the toolbox, but I would really hesitate to say it's always the wrong thing to do.

Edit: here's a good example. I don't know if they still do this, but when I worked at Yahoo!, they used a modified version of PHP that applied a comprehensive sanitization process to all user inputs. As a frontend coder at Y!, all the information you pulled from request parameters, headers, etc, ran through this validation at the PHP level before your app code got to it. You can then literally splat this information into an html page raw, without any further treatment, and not expose the Y! property you work for to an XSS or other injection vectors. There were ways to obtain the raw input using explicit accessors when needed, and these workarounds were detectable by code monitoring tools and had to be reviewed and approved by security team(s). Overall this worked really quite well, in my opinion. Y! could hire junior frontend devs without deep knowledge of data encoding, security issues, etc etc and rest easy. I think the principle of safe-by-default, even if it means destruction of user input in some cases via aggressive sanitization, is a good principle to apply to a frontend framework.

link

zAy0LfpBZLC8mAC 3572 days ago

re edit:

No, that's just a terrible idea. It might work quite well in the sense that it prevents server security holes. But it makes for terrible usability, and potentially even security problems for the user. The user expects that their input is reproduced correctly, and if it isn't, that can potentially have catastrophic consequences because it might result in silent corruption.

If you think that it should be possible to have developers who don't understand this problem and its solution, you might as well argue that it should be possible to have developers who don't know any arithmetic or who are illiterate.

If you want to have some help from the computer in avoiding injection attacks, the solution is a type system that ensures that you cannot accidentally use user input as HTML or SQL, for example, or possibly automates coercion when you need to insert pieces of one language into another.

link