Hacker News new | ask | show | jobs
by stummjr 3805 days ago
Hey, not sure if I understood what you mean. Did you mean:

1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed? 2) detect pages that have changed their structure, breaking down the Spider that crawl it.

3 comments

1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?

You could use the deltafetch[1] middleware. It ignores requests to pages with items extracted in previous crawls.

2) detect pages that have changed their structure, breaking down the Spider that crawl it.

This is a tough one, since most of the spiders are heavily based on the HTML structure. You could use Spidermon [2] to monitor your spiders. It's available as an addon in the Scrapy Cloud platform [3], and there are plans to open source it in the near future. Also, dealing automatically with pages that change their structure is in the roadmap for Portia [4].

[1] https://github.com/scrapinghub/scrapylib/blob/master/scrapyl...

[2] http://doc.scrapinghub.com/addons.html?highlight=monitoring#...

[3] http://scrapinghub.com/scrapy-cloud/

[4] http://scrapinghub.com/portia/

> 1) detect pages that had changed since the last crawl, to avoid recrawling pages that hadn't changed?

Usually web clients use https://en.wikipedia.org/wiki/HTTP_ETag , afais. If a web app\server lacks that skill, then you could compute your own hash and check it yourself, instead of processing that condition at the network layer.

As someone who does a fair amount of scraping at his job, I'd like to hear what you have to say regarding both questions :)