Hacker News new | ask | show | jobs
by jws 4596 days ago
In moving a site from PHP to go in a rewrite, I'm about to head down that same road. The plan is to use the HTML parser from the "not quite in the real distribution but kind of official" net repository:

http://godoc.org/code.google.com/p/go.net/html

Parse the HTML, walk the result, write that which is acceptable.

I have to restrict by tag, url scheme, and url server name in various contexts.

2 comments

I looked at that myself, and decided that it wasn't the path I should go down.

ParseFragment throws an error on bad input, but actually I just want that stripped and to carry on processing things. If a user has put in a mostly usable piece of HTML and then got something wrong as an error rather than bad intent then permissiveness in how we handle that should rule.

And then I wondered about the wisdom of creating a potentially large security library on a not quite nailed down API.

Ultimately, given that this is a security thing, I figured it's best to go with the proven many-eyeballed solution that was had widespread acceptance.

Feel free to use the package we've provided, the bit of go code you need for it is:

    import (
    	"os/exec"
    )
    
    func SanitiseHTML(html string) (string, error) {
    	cleanse := exec.Command("java", "-jar", "/usr/sbin/cleanse.jar", "--permissive")
    
    	writer, err := cleanse.StdinPipe()
    	if err != nil {
    		return "", err
    	}
    
    	_, err = writer.Write([]byte(html))
    	if err != nil {
    		return "", err
    	}
    
    	err = writer.Close()
    	if err != nil {
    		return "", err
    	}
    
    	buff, err := cleanse.Output()
    	if err != nil {
    		return "", err
    	}
    
    	cleanse.Start()
    
    	return string(buff), nil
    }
Thanks to a fellow HN reader, we now have a head start on trying to create a HTML sanitizer.

https://github.com/microcosm-cc/bluemonday

Feel free to use it and help out.

Initial work was all done by Matt Jibson as part of his Google Reader Clone: https://github.com/mjibson/goread