Hacker News new | ask | show | jobs
by laarc 3839 days ago
It should be socially acceptable for Internet Archive to ignore robots.txt.

They have to respect it because we, collectively, say so. Obeying robots.txt is the minimum acceptable behavior for any robot, short of the Asmiov laws.

But archiving is different. I've been running into "Site was not archived due to robots.txt" more and more frequently. Often these are articles from ~2011 and earlier which the author no doubt would have wanted to be archived.

Trouble is, robots.txt is also the only thing that people really bother to set up. Maybe there's a way right now to indicate "Sure, archive my site please, and ignore my robots.txt." But if there is, it's not really common knowledge, and it's kind of unreasonable to expect every single website on the internet to opt-in to that.

On the flipside, it seems entirely reasonable that if someone really wants to opt out of archiving, that they explicitly go and tell Internet Archive. Circa 2016, Internet Archive is the only archive site that seems likely to persist to 2116. It's a shared time capsule, a ship that we all get a free ticket to board. If someone wants off, they can say so.

But right now, large swaths of the internet simply aren't being archived due to rules that don't entirely seem to make sense. There are excellent reasons for robot.txt, but opting out of "Make this content available to my children's children's children's children" seems perhaps beyond the scope of the original spec.

Would you feel ok with the Archive ignoring your robots.txt, or would you feel annoyed? If annoyed, then this is a bad idea and should be rejected.

But if nobody really cares, then here's a proposal: Internet Archive stops checking /robots.txt, and checks for /archive.txt instead. If archive.txt exists, then it's parsed and obeyed as if it were a robots.txt file.

That way, every site can easily opt-out. But everyone opts-in by default. Sites can also exercise control over which portions they want archived, and how often.

6 comments

If example.com allowed indexing in 1999, a new owner of example.com can hide/delete the 1999-2015 content by changing the robots.txt in 2015.

It would be better if archive.org would adhere to the robots.txt of the requested date/year (show content of example.com from 1999-2014).

The fact that all popular URLs which fall out of registration are now picked up by squatter-spambots is also troubling. An Archive.org entry should not cease to exist when the registration lapses if the squatter-spambots decide to robots.txt everything. That would defeat its purpose completely.
I think the archive.org crawler should respect robots.txt as it looked at the time of the crawl. As a well-behaved robot, archive.org's crawler should fetch and respect robots.txt each time it crawls. However, archive.org should not retroactively delete old content when the current site puts up a robots.txt.

(To answer your other question, the robots.txt standard already allows giving different instructions to different crawlers.)

The situation is a bit more nuanced then that. I had a website on shared hosting, and it was being indexed by archive.org. But years ago (maybe a decade?), their robot was doing something crazy that was overwhelming sites, and the server admin blocked the Internet Archive robots. Even worse, archive.org interpreted the block retroactively and deleted all the archives.

I would have loved for my site to be archived, but I also need my site to perform well. I'm savvy enough to use robots.txt but not to monitor my site's CPU - and I imagine a lot of people with Wordpress or Squarespace sites don't even know about robots.txt. We need to find easy ways for people to control how their sites are archived. (And I don't know how any of this would fit with EU laws like the Right To Be Forgotten.)

The Archive doesn't delete anything; depending on the current robots.txt, they may not show pages from past crawls.

Update the robots.txt and you should be good to go.

Very well said and I strongly agree. What's the worse is that highly legitimate sites that existed for years get domain parked after shutting down and become suddenly inaccessible. Maybe for sites like that they can make the archives before the switchover available but it would probably be too costly staff-wise to look at each case-by-case.
robots.txt already lets you specify per-robot behaviour. You can trivially opt-out of crawling, but opt-in to archiving by explicitly allowing archive.org's bot and disallowing all other user agents.
Sorry a little too drunk to scan your post, but have considered this before.

I think Archive dot org as they said on Science Friday podcast are not legal archive or otherwise final word, just trying to help out with archiving humanity. If I want to delete some old posts for whatever unsupported reason (or if takeover of domain new robots.txt) then that's how it should go.

IMO.

Read your post tomorrow. I guarantee you will laugh. I've been there.