Hacker News new | ask | show | jobs
by pknopf 2984 days ago
Is there a good html parser that you can pipe into/out of? Using Go, it would be easier to parse the html, and tell the difference between <p>http://somewhere.com/</p> and <a href="http://somewhere.com/">test</p>
1 comments

Personally I don't care about the difference between your two examples because a published URL is still a published URL regardless of whether it is a hyperlink or not.

However where I do care about the difference between an anchor and a paragraph block is with relative links or other URLs without the protocol prefix as those are harder to programmatically guess what is a web URL and what is an example file system path in a technical document (for example). In an ideal world the JavaScript, CSS and any web-APIs (eg JSON returns) would be executed locally to check any modern way of abstracting away URLs (page redirection et al). But that's not to say there isn't a place for a less sophisticated parser (though I would say that as I've also written a link checker similar to the one posted hehehe).