Hacker News new | ask | show | jobs
by ww520 703 days ago
Still looking for a way to safely parse HTML string into DOM while avoiding XSS attacks. Most solutions end up with sanitizing input.
4 comments

Our approach at work: parse it as HTML, define a short list of known-acceptable tags & attributes, and strip everything else.

Limiting attributes to ["href", "src"] and tags to ["p", "br", "h1", "ul", "ol", "li", "span", "div", "img"] gets you remarkably close to rendering the safe bits of HTML - add to that list upon request.

If you want to take it further, use an `iframe srcdoc=""` with sandbox attributes set.

> Limiting attributes to ["href", "src"]

You need to clean that up as well to avoid e.g. javascript: links, and then there are more issues with SVG if you allow media uploads.

Then you need to be very sure you’re using a proper html5 parser and your rendering is completely canonicalized or you open yourself up to filter evasions (https://cheatsheetseries.owasp.org/cheatsheets/XSS_Filter_Ev...)

And of course I assume that’s what you meant but you should not add upon request, you should evaluate the addition.

Yes - just double checked those, thankfully the framework builtins are correct (staying up to date with a well maintained framework does wonders for your security posture).
Wasn't there this case of a security issue coming from abusing different parsers, in different places? Server, client, or different browsers
If you want to avoid XSS attacks, have you tried a CSP header? I know it is more of an output validation, as you restrict what can happen with external scripts.

You can only fit so many characters in your exploit, often due to max field lengths, unless you can load some external script. Disabling loading unknown external scripts with CSP significantly reduces possible attacks, including XSS attacks, because you simply don't have the space.

There's another one that works 100% of the time.

Do client server rendering. Send HTML, then query backend for content. Something like p.textContent = ... It's safe.

It's pretty much the same as what a prepared statement does in SQL, send data and code in different channels

Huh? This is a 100% solved problem in most languages. You just need to replace all of HTML's special characters with their escaped / encoded form. Eg, '<' becomes '&lt;', and so on for &, ", ', >, and all the rest.

There are libraries in almost every language to do this for you. A quick google search found these:

JS: https://github.com/parshap/html-escape

PHP: https://www.php.net/manual/en/function.htmlentities.php

And there are many more.

You are right in that it is solved if the goal is "I don't want any part of the string to be treated as HTML"

It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute". It is possible, with things like DOMPurify, but ideally you'd try to avoid this if at all possible.

> It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute"

yes, because you're no longer allowing HTML, but allowing something similar to HTML but not (and which subset is different for different people/project etc).

So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects.

Or, as an alternative, host the html (without the stripping of "undesirables") in a separate iframe, on a totally different domain, and rely on the browser's cross-origin protection to prevent undesirable scripts or data leaks.

"So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects."

This would be how I would choose to solve this, if the option was available.

But sometimes people do want some HTML compatibility for legit reasons.

If you want a data format which expresses some specific subset of HTML, well, do that then. Again, validate on output that the text you're showing is within the defined subset and escape everything else. Eg, "<strong> is passed verbatim to the browser but any other < character is replaced with &lt;".

This technique still works fine. You just need to also do the work of defining what your data format looks like, and how it should be parsed and displayed in a web browser.

> You just need to also do the work ...

and thus, make mistakes and allow XSS.

I don't think this is sufficient. Scripts could still do bad things, for example mining crypto.
Markdown sources can contain HTML, which most parsers will gladly spit back out unescaped unless it's wrapped in a code block.

I would much rather trust a sanitizer library written by someone who knows about security, than trust a Markdown parser that was never intended for that kind of role. I've built apps that ingest Markdown, and I always pipe the parser's output to a proper sanitizer.

Using an iframe is a clever workaround, but good luck convincing Google et al. to treat the contents of that iframe as part of the page you want indexed.