| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isalmon 2559 days ago
	#3 respect robots.txt

3 comments

foob 2559 days ago

This is a polite thing to do, but I don't think that there is any legal precedence for it being an actual requirement. Notably, both Apple and The Wayback Machine publicly disregard robots.txt files [1]. I would be very curious to read any court ruling that determined a robots.txt file needs to be respected.

[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...

link

paxys 2559 days ago

It depends on the intention. You should respect robots.txt for search indexing, for example, but not necessarily for something like archiving or creating alternative page layouts (e.g outline/reader view).

link

viraptor 2559 days ago

Wayback machine does look at robots.txt - https://help.archive.org/hc/en-us/articles/360004651732-Usin...

link

foob 2559 days ago

They look at them, but they don't follow them strictly [1]. They make judgement calls on what they should do rather than treating robots.txt files as a legal contract.

[1] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

link

mrighele 2559 days ago

It's a pity that robots.txt doesn't let you specify what the crawler can do with the resources it's allowed to fetch. I think that if we had such a feature (or something similar, like a "License" header) standardized early enough , a few issues regarding crawling and search engines would be moot, or at least easier to solve automatically.

link

qwerty456127 2559 days ago

True but all the commercial websites would use it to ban scraping then.

link

siddboots 2559 days ago

If we're talking about being polite, then #4 respect the TOS. Especially requests per minute.

link

corebit 2559 days ago

It’s the TOS itself that is legally tenuous, so you’re best bet is to completely ignore it. There’s no picking and choosing part s of it. Ignore all of it or implicitly accept all of it.

link