Hacker News new | ask | show | jobs
by olegp 6429 days ago
Very few pages have well formed mark-up. The few large scraping projects I've seen have started out with a mark-up based approach and then switched to regular expressions.

What experiences has everyone else had?

3 comments

I've had a lot of success with BeautifulSoup. It turns terrible HTML into a usable DOM tree.
i am using beatiful soup, it works with malformed markup
Uhm. I though any valid HTML is also valid SGML? Are you sure you're not confusing it with XML markup?
I think he's saying that the HTML-parsing approach only works when the HTML is well-formed and for most sites, it isn't.