Hacker News new | ask | show | jobs
by jorams 2101 days ago
How do you handle robots.txt? The previous incarnation of Always Online didn't care about robots.txt, while archive.org does.
2 comments

https://blog.cloudflare.com/cloudflares-always-online-and-th...

We tell archive.org about the URI, they crawl it. They handle robots.txt.

archive.org doesn't handle robots.txt in any meaningful way (see my comment above at https://news.ycombinator.com/item?id=24516875 ). If that's changed recently, I'd like to know more.
Note that archive.org stopped respecting robots.txt since 2017. [1]

In my experience, the site owner must email archive.org support to be excluded from its crawler and archiving.

[1]: https://boingboing.net/2017/04/22/internet-archive-to-ignore...