Hacker News new | ask | show | jobs
by SyneRyder 3839 days ago
The situation is a bit more nuanced then that. I had a website on shared hosting, and it was being indexed by archive.org. But years ago (maybe a decade?), their robot was doing something crazy that was overwhelming sites, and the server admin blocked the Internet Archive robots. Even worse, archive.org interpreted the block retroactively and deleted all the archives.

I would have loved for my site to be archived, but I also need my site to perform well. I'm savvy enough to use robots.txt but not to monitor my site's CPU - and I imagine a lot of people with Wordpress or Squarespace sites don't even know about robots.txt. We need to find easy ways for people to control how their sites are archived. (And I don't know how any of this would fit with EU laws like the Right To Be Forgotten.)

1 comments

The Archive doesn't delete anything; depending on the current robots.txt, they may not show pages from past crawls.

Update the robots.txt and you should be good to go.