Our approach at work: parse it as HTML, define a short list of known-acceptable tags & attributes, and strip everything else.
Limiting attributes to ["href", "src"] and tags to ["p", "br", "h1", "ul", "ol", "li", "span", "div", "img"] gets you remarkably close to rendering the safe bits of HTML - add to that list upon request.
If you want to take it further, use an `iframe srcdoc=""` with sandbox attributes set.
Yes - just double checked those, thankfully the framework builtins are correct (staying up to date with a well maintained framework does wonders for your security posture).
If you want to avoid XSS attacks, have you tried a CSP header? I know it is more of an output validation, as you restrict what can happen with external scripts.
You can only fit so many characters in your exploit, often due to max field lengths, unless you can load some external script.
Disabling loading unknown external scripts with CSP significantly reduces possible attacks, including XSS attacks, because you simply don't have the space.
Huh? This is a 100% solved problem in most languages. You just need to replace all of HTML's special characters with their escaped / encoded form. Eg, '<' becomes '<', and so on for &, ", ', >, and all the rest.
There are libraries in almost every language to do this for you. A quick google search found these:
You are right in that it is solved if the goal is "I don't want any part of the string to be treated as HTML"
It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute". It is possible, with things like DOMPurify, but ideally you'd try to avoid this if at all possible.
> It's trickier if the goal is "I want to allow <strong> and <em> tags in the string to be rendered as bold and italic, but I don't want scripts to execute"
yes, because you're no longer allowing HTML, but allowing something similar to HTML but not (and which subset is different for different people/project etc).
So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects.
Or, as an alternative, host the html (without the stripping of "undesirables") in a separate iframe, on a totally different domain, and rely on the browser's cross-origin protection to prevent undesirable scripts or data leaks.
"So i personally would move up the requirements chain, where the requirement to allow "html" should be scrapped, and instead changed to be something like markdown - a pre-existing formatting protocol that does not have the undesirable aspects."
This would be how I would choose to solve this, if the option was available.
But sometimes people do want some HTML compatibility for legit reasons.
If you want a data format which expresses some specific subset of HTML, well, do that then. Again, validate on output that the text you're showing is within the defined subset and escape everything else. Eg, "<strong> is passed verbatim to the browser but any other < character is replaced with <".
This technique still works fine. You just need to also do the work of defining what your data format looks like, and how it should be parsed and displayed in a web browser.
Markdown sources can contain HTML, which most parsers will gladly spit back out unescaped unless it's wrapped in a code block.
I would much rather trust a sanitizer library written by someone who knows about security, than trust a Markdown parser that was never intended for that kind of role. I've built apps that ingest Markdown, and I always pipe the parser's output to a proper sanitizer.
Using an iframe is a clever workaround, but good luck convincing Google et al. to treat the contents of that iframe as part of the page you want indexed.
Limiting attributes to ["href", "src"] and tags to ["p", "br", "h1", "ul", "ol", "li", "span", "div", "img"] gets you remarkably close to rendering the safe bits of HTML - add to that list upon request.
If you want to take it further, use an `iframe srcdoc=""` with sandbox attributes set.