| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by DenisM 6429 days ago

I use python to write scripts of this nature (one script so far:)).

Python has SGML SAX parser and since HTML is SGML it can be used. Better than regexps any day.

Python's http client library also supports cookies so that you can pretend to have a "session" with your target website.

EDIT: the libraries are urllib2, sgmllib, cookielib

2 comments

olegp 6429 days ago

Very few pages have well formed mark-up. The few large scraping projects I've seen have started out with a mark-up based approach and then switched to regular expressions.

What experiences has everyone else had?

link

Harkins 6429 days ago

I've had a lot of success with BeautifulSoup. It turns terrible HTML into a usable DOM tree.

link

joseakle 6429 days ago

i am using beatiful soup, it works with malformed markup

link

DenisM 6429 days ago

Uhm. I though any valid HTML is also valid SGML? Are you sure you're not confusing it with XML markup?

link

ryanwaggoner 6429 days ago

I think he's saying that the HTML-parsing approach only works when the HTML is well-formed and for most sites, it isn't.

link

jreilly 6429 days ago

Thanks for the input. I am currently learning rails so also wondering if there are any libraries that will make this significantly easier

link