Hacker News new | ask | show | jobs
Ask HN: Open source focused crawler?
6 points by cookerware 4515 days ago
Is there an open source crawler/library that will recursively follow only links under a certain xpath and ignore the rest?

I don't want to do an exhaustive crawl of every single link, I want something that will only follow links under a main content area.

3 comments

I highly recommend Scrapy (http://www.scrapy.org).

From their site:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Check this out : http://commoncrawl.org/

Its not exactly what you are looking for but might help you.

Have you tried BeautifulSoup?