Hacker News new | ask | show | jobs
by MajesticHobo 3290 days ago
You've said nothing that contradicts my post.

As long as the data is sanitized before it can affect the storage/transport mechanism for its content type, you're good.

1 comments

> As long as the data is sanitized before it can affect the storage/transport mechanism for its content type, you're good.

No, not really. Storing the user's data as is is almost always of paramount importance. The fact that it may be output as HTML/XML/MarkDown/whatever means that it really is at output-time that you must sanitize/escape/quote.

That's why the moral of the Bobby Tables story isn't: "Oh, just remove all semicolons". It's "use prepared queries".

I don't disagree with sanitizing data at output time when it's clear that A) the input won't affect anything else and B) output is going to happen. But realize not all input winds up in a SQL database, not all input will be considered valid in all contexts, and not all input eventually becomes output.

Sometimes, data really does need to be sanitized at the point of submission. If you disagree, that's more of a point about application design than appsec.

> But realize not all input winds up in a SQL database, not all input will be considered valid in all contexts, and not all input eventually becomes output.

That was the point I was trying to make: Sanitizing input is fail-from-the-start. There's no way to know ahead of time what outputs you're going to be producing 5 years from now. Conclusion: Store all input exactly as received. (We can do that these days with form/url encodings and whatnot).

Ok, so now you have the data stored accurately.

Next step: You need to output to, let's say, HTML. Ok, so you just escape/quote everything appropriately and nobody gets hurt. If you just do the escaping/quoting properly there is no XSS attacks. It's really just that simple.

However, it is NOT about sanitizing at the "input" point. Do you get what I'm saying now?

(I realize that that sounds aggressive, but I really just want to force this point home. Please tell me if you disagree or find some detail in my explanation confusing. This is important for the security of the web and either I'm wrong or you're wrong or I didn't understand what you said. Let's figure out which is the case.)

[1] There are caveats here.

You're all saying the same thing.

I didn't specify whether the sanitize occurred on receiving user input or displaying it.

I only said, sanitize all user input.

> I didn't specify whether the sanitize occurred on receiving user input or displaying it.

I'm sorry, but you basically did. You said:

> You must sanitize ALL user input even if you don't think you're going to render it on a web page

Which implies that sanitizing input at display time, when you know you're rendering it to a web page, is too late. That's why people are jumping on you. Keeping a clean database is the absolute most important thing you can do. The database isn't contextual. The data it stores can find its way into HTML pages, REST responses, SQL queries, PDF reports, XML/JSON data exports and a ton of other formats. Each of these output formats will require a different form of sanitizing. Sanitizing before the data hits disk creates a nightmare for anyone displaying the data in a context other than the sanitization that was performed. So what you said originally is precisely incorrect. Only sanitize input when you know it's going to be rendered to a webpage. Otherwise, leave it alone.

Now, you should be using view-layer frameworks to make that sanitization easy, automatic and the default action. When rendering to HTML, the templating language should sanitize by default and give a way for template authors to opt-out when they know the data did not come from user input. Likewise, in the SQL context, prepared statements also make it easy for the developer to do the right thing. But at no point are you speculatively sanitizing all user input. You're getting user input to disk in as pristine a format as possible and sanitizing contextually depending on how the data is outputted.