Hacker News new | ask | show | jobs
by isalmon 2559 days ago
#3 respect robots.txt
3 comments

This is a polite thing to do, but I don't think that there is any legal precedence for it being an actual requirement. Notably, both Apple and The Wayback Machine publicly disregard robots.txt files [1]. I would be very curious to read any court ruling that determined a robots.txt file needs to be respected.

[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...

It depends on the intention. You should respect robots.txt for search indexing, for example, but not necessarily for something like archiving or creating alternative page layouts (e.g outline/reader view).
Wayback machine does look at robots.txt - https://help.archive.org/hc/en-us/articles/360004651732-Usin...
They look at them, but they don't follow them strictly [1]. They make judgement calls on what they should do rather than treating robots.txt files as a legal contract.

[1] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

It's a pity that robots.txt doesn't let you specify what the crawler can do with the resources it's allowed to fetch. I think that if we had such a feature (or something similar, like a "License" header) standardized early enough , a few issues regarding crawling and search engines would be moot, or at least easier to solve automatically.
True but all the commercial websites would use it to ban scraping then.
If we're talking about being polite, then #4 respect the TOS. Especially requests per minute.
It’s the TOS itself that is legally tenuous, so you’re best bet is to completely ignore it. There’s no picking and choosing part s of it. Ignore all of it or implicitly accept all of it.