Hacker News new | ask | show | jobs
by stummjr 3805 days ago
1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?

You could use the deltafetch[1] middleware. It ignores requests to pages with items extracted in previous crawls.

2) detect pages that have changed their structure, breaking down the Spider that crawl it.

This is a tough one, since most of the spiders are heavily based on the HTML structure. You could use Spidermon [2] to monitor your spiders. It's available as an addon in the Scrapy Cloud platform [3], and there are plans to open source it in the near future. Also, dealing automatically with pages that change their structure is in the roadmap for Portia [4].

[1] https://github.com/scrapinghub/scrapylib/blob/master/scrapyl...

[2] http://doc.scrapinghub.com/addons.html?highlight=monitoring#...

[3] http://scrapinghub.com/scrapy-cloud/

[4] http://scrapinghub.com/portia/