Hacker News new | ask | show | jobs
by jerf 5751 days ago
That's still the wrong approach (if it's the only part of the solution) and I wouldn't be surprised that there's still a problem in there somewhere. That's the entirely wrong place to deal with this. The correct solution is the moral equivalent of "<a href='" + html_escape(url) + "'>", where "html_escape" converts the URL into a properly encoded HTML string regardless of contents, and for simplicitly I'm assuming some other cleansing process has run on the url elsewhere (to ensure http: or https: is the only legal beginning, etc). (This is the way you ensure you don't get XSS in your link. Other security properties that you may desire, such as controlling what the user can link to, get enforced elsewhere.)

Then it simply doesn't matter what the user has managed to get down to the link generation code, the html_escape code should at least ensure that the user is stuck in the link itself. There are some paranoia things such a function should still do, such as remove all characters that are not legal in links or removing all invalid characters (incorrect UTF-8, for instance), consult the relevant standards standard for a full description. But this is still way easier and therefore more likely to correctly avoid XSS than trying to pick up all possible badness at the parse step.

It continues to astonish me how hard people make this and how much developers resist being told that their code is problematic, and how surprised they are when their site gets taken down by the stupidest errors....

Also, if at all possible, I strongly endorse environments where you don't literally type "<a href='" + html_escape(url) + "'>", because you will forget the html_escape. There are a variety of ways to reach this goal, depending on language.

2 comments

I don't understand what they are doing? I don't recall @ having special significance in a URL?

I can only guess that they have two separate steps for transforming URLs into links and transforming @replies into links. Then they first run the URL transformer and then the @replies transformer, which would of course mess up the URL.

I have solved that problem in one of my Twitter apps (transforming both in one go), maybe I should send them a code snippet...

They are trying to match URLs so that they can turn them into links. The @ character is valid in a URL. What I don't understand is why they don't URL encode the matching text.
Actually, you want to perform URL encoding in such example, as "javascript:alert(1)" when escaping HTML entities will slip past un-encoded.

See http://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%2... for more information.

As Rule #5 of your own link states: "WARNING: Do not encode complete or relative URL's with URL encoding! URL's should be encoded based on the context of display like any other piece of data. For example, user driven URL's in HREF links should be attribute encoded."

URL encoding is for querystring parameters. The HTML escaping is for the inside of attributes. You need to do both, in the proper place; I assumed you already had a URL with the proper escaping at the time that I was discussing, again, for simplicity, because the full story doesn't really fit in an HN comment: http://www.jerf.org/iri/post/2548

That's also why I mention you need a separate phase specially for URLs, where you will for instance immediately reject any URL that does not start with one of your whitelisted protocols, which "javascript:" won't be on. "javascript:" is far from the only protocol that can get you in trouble, it's just the most obvious.