Hacker News new | ask | show | jobs
by betolink 3587 days ago
I worked on a research project to develop a web-scale "google" for scientific data and we found very interesting things on robots.txt, from "don't crawl us" to "crawl 1 page every other day" or even better "don't crawl unless you're google".

Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it.

Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...)

1 comments

This is why I think Google's position as the #1 search engine will never go away. Many sites will tell your bot to go away if you're not Google. They don't care if you're building a search engine that will compete with Google.
At blekko, we did not find this issue to be a significant one... almost everyone who banned our crawler was a crappy over-SEOed website.
https://www.linkedin.com/robots.txt

https://yelp.com/robots.txt

There goes all Linkedin + Yelp content from your index.

What about https://www.facebook.com/robots.txt

..and medium-sized/small sites are even worse.

The irony of Facebook being a core part of all NSA surveillance programs and their terms of service including their "Automated Data Collection Terms" https://www.facebook.com/apps/site_scraping_tos_terms.php

If you surf LinkedIn logged out, you'll see that there isn't very much information available anyway. And there's no money in people search.

Yelp was very responsive when blekko wrote them; as you can see ScoutJet has the same access as googlebot.