|
|
|
|
|
by betolink
3587 days ago
|
|
I worked on a research project to develop a web-scale "google" for scientific data and we found very interesting things on robots.txt, from "don't crawl us" to "crawl 1 page every other day" or even better "don't crawl unless you're google". Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it. Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...) |
|