|
|
|
|
|
by mrkramer
817 days ago
|
|
Internet Archive's crawler is not respecting robots.txt because they want to archive everything not just parts of the Web. But if you are actively breaking robots.txt then your crawler will have a bad reputation and you will have an army of webmasters trying to block your crawler by any means. You can see crawling requests in your sever logs, that's how you know if they are respecting it or not. Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content. |
|
Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."
What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.