Hacker News new | ask | show | jobs
by zAy0LfpBZLC8mAC 4456 days ago
Well, yeah, "transcode" might be better, but then again there isn't really any hard difference between "encode" and "transcode", or possibly "encode" is just useless because it can not ever happen without an associated decoding of the information source?

But no, in a way, you are getting it all backwards, or at least a bit confusing.

This is how you should construct a system that processes user input:

First, the input format should be defined such that it can only describe things that make sense within the given context, in particular it should usually not be possible to represent in it instructions for programming language interpreters.

Second, whenever you have to represent user input in some context, you have to encode (well, transcode) it into the format of that context. This transcoding generally should only change representation and not change the meaning of the converted information.

This automatically implies that you can not "inject code". There isn't really anything magic about "code". That's what I think is a large part of the confusion around "sanitizing input". The input can not represent code, the conversion does not change the meaning, so if the input can not represent code, the transcoding obviously can not cause code to appear either, and thus you are safe - and not only are you safe, but your system also works as it should otherwise, which it potentially does not if you start "removing dangerous characters".

That is why you should not "sanitize", but only validate and encode/transcode/convert. Which you need to do anyway for your system to work properly. Lack of injection vulnerabilities will result automatically.