| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by betolink 3587 days ago

I worked on a research project to develop a web-scale "google" for scientific data and we found very interesting things on robots.txt, from "don't crawl us" to "crawl 1 page every other day" or even better "don't crawl unless you're google".

Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it.

Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...)

1 comments

AznHisoka 3587 days ago

This is why I think Google's position as the #1 search engine will never go away. Many sites will tell your bot to go away if you're not Google. They don't care if you're building a search engine that will compete with Google.

link

greglindahl 3587 days ago

At blekko, we did not find this issue to be a significant one... almost everyone who banned our crawler was a crappy over-SEOed website.

link

AznHisoka 3587 days ago

https://www.linkedin.com/robots.txt

https://yelp.com/robots.txt

There goes all Linkedin + Yelp content from your index.

link

betolink 3587 days ago

What about https://www.facebook.com/robots.txt

..and medium-sized/small sites are even worse.

The irony of Facebook being a core part of all NSA surveillance programs and their terms of service including their "Automated Data Collection Terms" https://www.facebook.com/apps/site_scraping_tos_terms.php

link

greglindahl 3587 days ago

If you surf LinkedIn logged out, you'll see that there isn't very much information available anyway. And there's no money in people search.

Yelp was very responsive when blekko wrote them; as you can see ScoutJet has the same access as googlebot.

link