Hacker News new | ask | show | jobs
by toomuchtodo 4582 days ago
You just ignore the robots.txt file, crawl slowly, and from distributed virtual machines.

Not that you should do that. Robots.txt is a nicety though, the client doesn't have to respect it, and the server doesn't have to allow your HTTP requests.