| It should be socially acceptable for Internet Archive to ignore robots.txt. They have to respect it because we, collectively, say so. Obeying robots.txt is the minimum acceptable behavior for any robot, short of the Asmiov laws. But archiving is different. I've been running into "Site was not archived due to robots.txt" more and more frequently. Often these are articles from ~2011 and earlier which the author no doubt would have wanted to be archived. Trouble is, robots.txt is also the only thing that people really bother to set up. Maybe there's a way right now to indicate "Sure, archive my site please, and ignore my robots.txt." But if there is, it's not really common knowledge, and it's kind of unreasonable to expect every single website on the internet to opt-in to that. On the flipside, it seems entirely reasonable that if someone really wants to opt out of archiving, that they explicitly go and tell Internet Archive. Circa 2016, Internet Archive is the only archive site that seems likely to persist to 2116. It's a shared time capsule, a ship that we all get a free ticket to board. If someone wants off, they can say so. But right now, large swaths of the internet simply aren't being archived due to rules that don't entirely seem to make sense. There are excellent reasons for robot.txt, but opting out of "Make this content available to my children's children's children's children" seems perhaps beyond the scope of the original spec. Would you feel ok with the Archive ignoring your robots.txt, or would you feel annoyed? If annoyed, then this is a bad idea and should be rejected. But if nobody really cares, then here's a proposal: Internet Archive stops checking /robots.txt, and checks for /archive.txt instead. If archive.txt exists, then it's parsed and obeyed as if it were a robots.txt file. That way, every site can easily opt-out. But everyone opts-in by default. Sites can also exercise control over which portions they want archived, and how often. |
It would be better if archive.org would adhere to the robots.txt of the requested date/year (show content of example.com from 1999-2014).