Hacker News new | ask | show | jobs
by henrybaxter 5533 days ago
You can get the best of all worlds imo by using lxml, which supports the selectors you want, uses Python which I prefer, and in my experience lxml is more robust than BeautifulSoup.

I spent more than a year writing hundreds of scrapers that ran for weeks at a time. BeautifulSoup did not work out as well as lxml in practice. On extremely javascript heavy pages we used pyv8 actually.

edit: more information at http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciat... the comments are useful too.

1 comments

Even better use Scrapy, which is a whole framework designed specifically for scraping and is built on top of libxml2 like lxml.
Scrapy is overkill for nearly everything. You'll probably have under a page of code using lxml and urllib.
I have under a page of code with Scrapy for simple projects, and more advanced features when I need them.

That's like saying "jQuery is overkill for just about everything, you should use plain javascript".

No, it's like saying "The full YUI suite is overkill for just about everything, you should just use the core or jQuery".

'scrapy startproject' creates a couple nested directories, with maybe seven files. Are you writing a scraper that you're going to run regularly? Does it need to be super robust and maintainable? Or are you writing something that you'll run once, maybe twice?

I seem to be missing why you think using a framework is a bad thing. With say django or YUI there are performance and abstraction issues that can bite you, but I don't see those mattering for so lightweight a framework and tightly scoped a problem.