Hacker News new | ask | show | jobs
by billpg 4456 days ago
Put yourself in the shoes of an inexperienced programmer building their first website. You've been advised to sanitize your inputs with the example of Bobby Tables.

You know the plain English meaning of "Sanitize". Clearly, you need to remove those single quote characters as they are unsanitary?

3 comments

Put yourself in the shoes of an inexperienced mathematician solving its first theorem. You've been advised to take care of infinity as described with two parallels crossing at infinity.

Problem, your theorem is dealing with discrete numerable infinity...

On the side note, English meaning of Sanitize is "Make clean and hygienic", nothing more. It says nothing about "removing". Other definitions are extensions based on CONTEXT, once again.

... which is exactly what you should not do. Here, let me post this:

"If you want to create a horizontal line in HTML, you write <hr>"

See that? There is nothing "unclean" about it, hence you should not "clean" it. You just have to encode it if you output it embedded in HTML. That's why calling it "sanitizing" is misleading.

Again, wrong.

Encoding without proper context means "convert in a coded form". Hum that's not exactly what we want. So, let's add the "computing context", now we have, as an example, the ability to encode a WAVE file into a MP3. But wait, we lost information here! Bummer...

Sanitization in the context of computing does not specifically means that you have to "encode", or better, "transcode". It means that you have to take appropriate measure so that your input DATA cannot be interpreted as CODE by the receiver. Bonus point is taken if the measure you choose is lossless in term of information carried by your data.

Well, yeah, "transcode" might be better, but then again there isn't really any hard difference between "encode" and "transcode", or possibly "encode" is just useless because it can not ever happen without an associated decoding of the information source?

But no, in a way, you are getting it all backwards, or at least a bit confusing.

This is how you should construct a system that processes user input:

First, the input format should be defined such that it can only describe things that make sense within the given context, in particular it should usually not be possible to represent in it instructions for programming language interpreters.

Second, whenever you have to represent user input in some context, you have to encode (well, transcode) it into the format of that context. This transcoding generally should only change representation and not change the meaning of the converted information.

This automatically implies that you can not "inject code". There isn't really anything magic about "code". That's what I think is a large part of the confusion around "sanitizing input". The input can not represent code, the conversion does not change the meaning, so if the input can not represent code, the transcoding obviously can not cause code to appear either, and thus you are safe - and not only are you safe, but your system also works as it should otherwise, which it potentially does not if you start "removing dangerous characters".

That is why you should not "sanitize", but only validate and encode/transcode/convert. Which you need to do anyway for your system to work properly. Lack of injection vulnerabilities will result automatically.

If I am an inexperienced programmer building their first website and I refuse to even google what sanitizing inputs is, the last thing I need is some headline telling me "NEVER sanitize [my] inputs." Presumably, I won't read that either.
English is not my primary language and i don't know the exact etymology of the word 'sanitize' but it sounds more to me like you have to make the input 'sane' or acceptable. It doesn't imply to remove anything, rather to escape problematic characters, in this case the quotes.
Which is exactly where the confusion is. The input is perfectly sane, it just isn't SQL or HTML, but perfectly sane plain text, which can be converted into perfectly sane HTML or perfectly sane SQL, but none of those is in any way "more sane", it's just the right format for a given use - if you were to put the plain text into a plain text email body, for example, you would not have to do any conversion at all.