| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jorams 2101 days ago
	How do you handle robots.txt? The previous incarnation of Always Online didn't care about robots.txt, while archive.org does.

2 comments

jgrahamc 2101 days ago

https://blog.cloudflare.com/cloudflares-always-online-and-th...

We tell archive.org about the URI, they crawl it. They handle robots.txt.

link

AnonHP 2100 days ago

archive.org doesn't handle robots.txt in any meaningful way (see my comment above at https://news.ycombinator.com/item?id=24516875 ). If that's changed recently, I'd like to know more.

link

AnonHP 2100 days ago

Note that archive.org stopped respecting robots.txt since 2017. [1]

In my experience, the site owner must email archive.org support to be excluded from its crawler and archiving.

[1]: https://boingboing.net/2017/04/22/internet-archive-to-ignore...

link