| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by henrybaxter 5581 days ago

You can get the best of all worlds imo by using lxml, which supports the selectors you want, uses Python which I prefer, and in my experience lxml is more robust than BeautifulSoup.

I spent more than a year writing hundreds of scrapers that ran for weeks at a time. BeautifulSoup did not work out as well as lxml in practice. On extremely javascript heavy pages we used pyv8 actually.

edit: more information at http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciat... the comments are useful too.

1 comments

cdr 5580 days ago

Even better use Scrapy, which is a whole framework designed specifically for scraping and is built on top of libxml2 like lxml.

link

krakensden 5580 days ago

Scrapy is overkill for nearly everything. You'll probably have under a page of code using lxml and urllib.

link

cdr 5580 days ago

I have under a page of code with Scrapy for simple projects, and more advanced features when I need them.

That's like saying "jQuery is overkill for just about everything, you should use plain javascript".

link

krakensden 5579 days ago

No, it's like saying "The full YUI suite is overkill for just about everything, you should just use the core or jQuery".

'scrapy startproject' creates a couple nested directories, with maybe seven files. Are you writing a scraper that you're going to run regularly? Does it need to be super robust and maintainable? Or are you writing something that you'll run once, maybe twice?

link

cdr 5571 days ago

I seem to be missing why you think using a framework is a bad thing. With say django or YUI there are performance and abstraction issues that can bite you, but I don't see those mattering for so lightweight a framework and tightly scoped a problem.

link