| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johanj 2973 days ago
	robots.txt is really only supposed to be used for blocking the Internet Archives first snapshot, and not to remove existing snapshots – and even this might not be the case in the future as they try to preserve most snapshots. They made a few policy changes last year[1] to how they handle robots.txt files, to handle cases where a domain is sold and a new robots.txt file would result in deleting old data among other things. [1]: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

2 comments

jrochkind1 2973 days ago

Hmm, that may be what it's meant for, but pretty sure it can currently be used to block things retroactively too. IA may still have it in the archive, but won't let viewers view it.

As happened in this case: https://news.ycombinator.com/item?id=16919017

No? The article you linked says they've stopped paying attention to robots.txt for US government and military sites, but it looks like it still retroactively removes visibility for everything else.

I guess IA could change their practices. If medium or people like them start actively using robots.txt to try to retroactively remove things from visibility in the archive, perhaps IA will change their practices/policy. I would welcome it.

gnode 2973 days ago

Interesting. I wasn't aware that it no longer applies retroactively. Even so, medium.com's robots.txt still doesn't try to block new crawling by the Internet Archiver:

https://medium.com/robots.txt

Or via WBM for posterity: https://web.archive.org/web/20180430183503/https://medium.co...

It seems unlikely to me that they would deliberately go to this length to prevent archival, yet not attempt to prevent it happening to begin with. Furthermore, as mentioned in your link, they still accept removal requests via email.