| You seem to be assuming a lot about a development environment that was never specified. This about writing software that handles input from an extern, potentially hostile source. Parsing URLs that were supplied by the user is one example of that. > Your code has to do something when it gets a URI Yes, that's exactly my point. You need to define what your code will do with any URL - actually, any input, including input that is malformed or malicious - which includes both known and all possible future schemes. For this specific example, the correct thing to do is recognize that e.g. your software only handles http{,s} URLs, so every other scheme should not be included in the recognized grammar. Any input outside that is invalid and dropped while dispatching any necessary error handling. > third party code ...is off topic. This is about handling input to any code you write. Any 3rd parties also need to define what they accept as input. > it will be prohibitively difficult for a new URI scheme (or what have you) to gain traction. That is a separate problem that will always exist. You're trying to prematurely optimize in an insecure way. Worrying about potential future problems doesn't justify writing bad code today that passes hostile data without verification. If you know that a URL scheme - or collection of schemes - will be handled properly, then define it as valid and pass it along. If it isn't handled or you don't know if it will be handled properly, define it as invalid and drop it. Doing otherwise is choosing to add a security hole. The same goes for every other byte of data received from a hostile source. |
The position you've staked out is "stop trying to enumerate badness." All I need is one good counterexample.
For example, Google Safe Browsing maintains a blacklist of malicious domains that clients can check. Are you suggesting that they should whitelist domains instead? What about subdomains? IP addresses?
How about email addresses for spam filtering?
You often don't have good (or any) information about whether a given instance of a thing is malicious or not. Blocking all such things also blocks the innocent things. In some contexts that's a cost you have to pay, but as a general rule it's not something you want.
> Yes, that's exactly my point. You need to define what your code will do with any URL - actually, any input, including input that is malformed or malicious - which includes both known and all possible future schemes.
You have to define what your code will do, but what it should do is the original question.
> For this specific example, the correct thing to do is recognize that e.g. your software only handles http{,s} URLs, so every other scheme should not be included in the recognized grammar.
That's just assuming the conclusion. You could also use a grammar that accepts any RFC3986-compliant URI that has a handler available for its scheme, and have the handler be responsible for malicious input.
> ...is off topic. This is about handling input to any code you write.
It's about where to handle and validate input. Most data is going to be passed through multiple independent applications on separate machines, through networks with multiple middleboxes, etc.
A general premise that you should block anything you don't recognize is flawed. It requires that everything would have to understand everything about everything, or discard it. An FTP client with a whitelist of files you can transfer is doing it wrong.