|
|
|
|
|
by stummjr
3805 days ago
|
|
Hey, not sure if I understood what you mean. Did you mean: 1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?
2) detect pages that have changed their structure, breaking down the Spider that crawl it. |
|
You could use the deltafetch[1] middleware. It ignores requests to pages with items extracted in previous crawls.
2) detect pages that have changed their structure, breaking down the Spider that crawl it.
This is a tough one, since most of the spiders are heavily based on the HTML structure. You could use Spidermon [2] to monitor your spiders. It's available as an addon in the Scrapy Cloud platform [3], and there are plans to open source it in the near future. Also, dealing automatically with pages that change their structure is in the roadmap for Portia [4].
[1] https://github.com/scrapinghub/scrapylib/blob/master/scrapyl...
[2] http://doc.scrapinghub.com/addons.html?highlight=monitoring#...
[3] http://scrapinghub.com/scrapy-cloud/
[4] http://scrapinghub.com/portia/