|
|
|
|
|
by runbycomment
4066 days ago
|
|
Agreed that it can be a headache, but wanted to offer an alternative perspective. Personally, I feel that inclusion in Google constitutes public access to the data. As long as I'm not logged into an account on their system, I feel ethically justified about scraping their data. In other words, I do not feel compelled to respect robots.txt if that file does not also block googlebot. Legally it may be another issue, but ethically I consider inclusion in Google as an announcement that this information is public. |
|
In the early stages you probably won't be robots.txt'd because you're insignificant.
In later stages, you're hoping to not be robots.txt'd because you're providing a worthwhile service not just for users but for the site.
At neither stage should you force companies that want you not indexing their content to go beyond basic means (robots.txt) because the more serious measures are all going to cost them more money (tracking / blocking your IPs, C&D, DMCA requests to your provider requesting that the entire site be taken down because there are thousands of infringing items, lawsuits seeking (damages | court costs | costs for dealing with your circumvention of technical measures to keep you out of the site), finding of friendly prosecutors, etc.).
You don't want to go down that more expensive road.