Hacker News new | ask | show | jobs
by buro9 4596 days ago
One thing I'd like to see in Go is a way to sanitise HTML based on a whitelist.

This is to accompany blackfriday (Markdown) and text/template (templating).

Markdown permits HTML, and this allows some scope for nasty stuff to get in, or for bugs that may exist in blackfriday to be exploited leading to HTML that could be the source of a XSS attack.

We're currently running our user generated content through this: https://github.com/microcosm-cc/cleanse and more specifically this: https://github.com/microcosm-cc/cleanse/blob/master/src/main...

Which is a set of rules for OWASPs HTML sanitizer: https://code.google.com/p/owasp-java-html-sanitizer/

This works extremely well, except for the fact that it means in our Go code we're calling out to a process and asking for a Java process to be kicked up for each request.

If no-one beats us to it we'll be porting that to Go... but for us it remains a "When we need to" from a scaling or performance perspective.

Still... we'd love to see group work on a whitelist based sanitiser that we can contribute to rather than us go and write our own at some indeterminate point in the future.

1 comments

In moving a site from PHP to go in a rewrite, I'm about to head down that same road. The plan is to use the HTML parser from the "not quite in the real distribution but kind of official" net repository:

http://godoc.org/code.google.com/p/go.net/html

Parse the HTML, walk the result, write that which is acceptable.

I have to restrict by tag, url scheme, and url server name in various contexts.

I looked at that myself, and decided that it wasn't the path I should go down.

ParseFragment throws an error on bad input, but actually I just want that stripped and to carry on processing things. If a user has put in a mostly usable piece of HTML and then got something wrong as an error rather than bad intent then permissiveness in how we handle that should rule.

And then I wondered about the wisdom of creating a potentially large security library on a not quite nailed down API.

Ultimately, given that this is a security thing, I figured it's best to go with the proven many-eyeballed solution that was had widespread acceptance.

Feel free to use the package we've provided, the bit of go code you need for it is:

    import (
    	"os/exec"
    )
    
    func SanitiseHTML(html string) (string, error) {
    	cleanse := exec.Command("java", "-jar", "/usr/sbin/cleanse.jar", "--permissive")
    
    	writer, err := cleanse.StdinPipe()
    	if err != nil {
    		return "", err
    	}
    
    	_, err = writer.Write([]byte(html))
    	if err != nil {
    		return "", err
    	}
    
    	err = writer.Close()
    	if err != nil {
    		return "", err
    	}
    
    	buff, err := cleanse.Output()
    	if err != nil {
    		return "", err
    	}
    
    	cleanse.Start()
    
    	return string(buff), nil
    }
Thanks to a fellow HN reader, we now have a head start on trying to create a HTML sanitizer.

https://github.com/microcosm-cc/bluemonday

Feel free to use it and help out.

Initial work was all done by Matt Jibson as part of his Google Reader Clone: https://github.com/mjibson/goread