| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kevincox 1255 days ago
	FWIW I don't think this really fits into robots.txt. That file is mostly aimed at crawlers. Not for services loading specific URLs due to (sometimes indirect) user requests. ...but as a place that could hold a rate limit recommendation it would be nice since it appears that the Git protocol doesn't really have the equivalent of a Cache-Control header.

3 comments

rakoo 1255 days ago

> Not for services loading specific URLs due to (sometimes indirect) user requests.

A crawler has a list of resources it periodically checks to see if it changed, and if it did, indexes it for user requests.

Contrary to this totally-not-a-crawler, with its own database of existing resources, that periodically checks if anything changed, and if it did, caches content and builds chescksums.

link

cmatthias 1255 days ago

I'm taking the OP at his word here, but he specifically claims that the proxy service making these requests will also make requests independent of a `go get` or other user-initiated action, sometimes to the tune of a dozen repos at once and 2500 requests per hour. That sounds like a crawler to me, and even if you want to argue the semantic meaning of the word "crawler," I strongly feel that robots.txt is the best available solution to inform the system what its rate limit should be.

link

kevincox 1255 days ago

When I mean crawler I mean something that discovers new pages. Refreshing the same URL isn't really crawling.

But yes, it may be the best available solution in this case, even if I would argue that it isn't really it's main purpose.

link

cmatthias 1255 days ago

After reading this and your response to a sibling comment I wholeheartedly disagree with you on both the specific definition of the word crawler and what the "main purpose" of robots.txt is, but glad we can agree that Google should be doing more to respect rate limits :)

link

ddevault 1255 days ago

What you're thinking about, in my opinion, is best referred to as a spider.

link

Arnavion 1255 days ago

As annoying as it is, there is precedent for this opinion with RSS aggregator websites like Feedly. They discover new feed URLs when their users add them, and then keep auto-refreshing them without further explicit user interaction. They don't respect robots.txt either.

link

kevincox 1255 days ago

I wouldn't expect or want an RSS aggregator to respect robots.txt for explicitly added feeds. That is effectively a human action asking for that feed to be monitored so robots.txt doesn't apply.

What would be good is respecting `Cache-Control`, which unfortunately many RSS clients don't, and just pick a schedule and poll on it.

link

Arnavion 1255 days ago

robots.txt was originally created to include such bots. That they think they don't need to respect it goes against the original intent.

Eg: https://www.robotstxt.org/faq/kinds.html >"What's New" monitoring

link

counttheforks 1255 days ago

I want my software to obey me, not someone else. If the software is discovering resources on its own, then obeying robots.txt is fair. But if the software is polling a resource I explicitly told it to, I would not expect it to make additional requests to fetch unrelated files such as a robots.txt

link

chillfox 1255 days ago

I can almost see both sides here... But ultimately when you are using someone else's resources, then not respecting their wishes (within reason) just makes you an asshole.

link