Hacker News new | ask | show | jobs
by npunt 1047 days ago
I adore archive.org. I'm worried though it is becoming somewhat of a load bearing element of civilization, given the importance of shared and accurate history. We need redundancy.

~~I'm also worried about the deletion of old pages on archive because new owners of a domain update the robots.txt file to disallow it, which I've heard wipes the entire archive.org history of that domain. I hope that gets addressed.~~

Edit: this is no longer the case

2 comments

Yeah wait, what? I hope that's incorrect. Updating robots.txt to disallow should only omit content from that point onward... It shouldn't be retroactive. What if there's a new owner of a respective domain, for example?
It was correct. And archive.org used this self inflicted problem to justify disregarding robots.txt.

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

Oh thank you for the update, that's great news! Agree with their position 100%.
> Agree with their position 100%.

Including the part where they blamed anyone but themselves for hiding old snapshots when robots.txt changed?

no I didn't see that in there, I just agree that robots.txt makes sense for web crawlers but not archival purposes
It isn't, I really wish that instead of wiping DECADES of history; it only applies to content that was archived from the day of the domain's registration. I think this is slightly more reasonable, but I imagine they simply don't have access to such data.
Simply requiring domain owners contact archive.org to remove old snapshots would be better than applying robots.txt retroactively.
The Internet Archive is not really known for deleting anything. In many different postings across the years, their founder and employees have mentioned items being taken "out" of the wayback machine, not "deleted" out of it. I don't think you have anything to worry about.