Hacker News new | ask | show | jobs
by nlogn 5609 days ago
I don't even think they should just blacklist Google. They should just respect robots.txt.

edit: I should have clarified. I know that the Bing crawler likely respects robots.txt, but if they are using clickstream info to build their index, it seems right that they should respect robots.txt there as well, no?

3 comments

I'm pretty sure the Bing Crawler does respect robots.txt. The data Bing collected didn't come from spidering Google.
You could strongly argue that collecting clickstream and other user browser session info via a toolbar is not a form of web robot (crawler, spider, etc.), and thus robots.txt does not apply.
I agree with your comments that toolbars should respect the robots.txt because even if a human is doing the crawling, it is still an automated system that is indexing information from that site. I would not want toolbars attempting to send data back to Bing based on my queries on a company Intranet or a site that would normally not be indexed. Personal data entered into what the toolbar thought was a query field could be sent onward as well even if the robots.txt on the site restricted it. I think they should respect robots.txt in this case even if they are only monitoring user behavior.