Hacker News new | ask | show | jobs
by DenisM 6429 days ago
I use python to write scripts of this nature (one script so far:)).

Python has SGML SAX parser and since HTML is SGML it can be used. Better than regexps any day.

Python's http client library also supports cookies so that you can pretend to have a "session" with your target website.

EDIT: the libraries are urllib2, sgmllib, cookielib

2 comments

Very few pages have well formed mark-up. The few large scraping projects I've seen have started out with a mark-up based approach and then switched to regular expressions.

What experiences has everyone else had?

I've had a lot of success with BeautifulSoup. It turns terrible HTML into a usable DOM tree.
i am using beatiful soup, it works with malformed markup
Uhm. I though any valid HTML is also valid SGML? Are you sure you're not confusing it with XML markup?
I think he's saying that the HTML-parsing approach only works when the HTML is well-formed and for most sites, it isn't.
Thanks for the input. I am currently learning rails so also wondering if there are any libraries that will make this significantly easier