Hacker News new | ask | show | jobs
by harvie 313 days ago
Maybe we can just configure webservers to block anyone who requests robots.txt, regular browsers don't do it, but robots do to get list of urls to crawl (while ignoring rules). Just create simple PHP/CGI script that adds client IP addres to iptables once /robots.txt is accessed.
1 comments

One way to easily bypass is to let external services fetching robots.txt (archive.org, GitHub actions, etc...) to cache it and either expose through separate apis/webhook/manual download to the actual scrape server.

robots txt file size is usually small and would not alert external services.