|
|
|
|
|
by fencepost
4066 days ago
|
|
Ignoring/bypassing robots.txt is probably a bad idea unless you're going to never even look for it and are going to try to plead incompetence if someone comes after you. In the early stages you probably won't be robots.txt'd because you're insignificant. In later stages, you're hoping to not be robots.txt'd because you're providing a worthwhile service not just for users but for the site. At neither stage should you force companies that want you not indexing their content to go beyond basic means (robots.txt) because the more serious measures are all going to cost them more money (tracking / blocking your IPs, C&D, DMCA requests to your provider requesting that the entire site be taken down because there are thousands of infringing items, lawsuits seeking (damages | court costs | costs for dealing with your circumvention of technical measures to keep you out of the site), finding of friendly prosecutors, etc.). You don't want to go down that more expensive road. |
|
This opeartes in what I consider a legal grey area. Don't make it obvious that you're scraping, only scrape public information, transform the results, proxy your requests, all contribute to lowering the legal profile (which is my only concern, as I feel I am acting within my own ethical limits).