| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cmatthias 1255 days ago

Why does the Go team and/or Google think that it's acceptable to not respect robots.txt and instead DDoS git repositories by default, unless they get put on a list of "special case[s] to disable background refreshes"?

Why was the author of the post banned without notice from the Go issue tracker, removing what is apparently the only way to get on this list aside from emailing you directly?

Do you, personally, find any of this remotely acceptable?

3 comments

kevincox 1255 days ago

FWIW I don't think this really fits into robots.txt. That file is mostly aimed at crawlers. Not for services loading specific URLs due to (sometimes indirect) user requests.

...but as a place that could hold a rate limit recommendation it would be nice since it appears that the Git protocol doesn't really have the equivalent of a Cache-Control header.

rakoo 1255 days ago

> Not for services loading specific URLs due to (sometimes indirect) user requests.

A crawler has a list of resources it periodically checks to see if it changed, and if it did, indexes it for user requests.

Contrary to this totally-not-a-crawler, with its own database of existing resources, that periodically checks if anything changed, and if it did, caches content and builds chescksums.

cmatthias 1255 days ago

I'm taking the OP at his word here, but he specifically claims that the proxy service making these requests will also make requests independent of a `go get` or other user-initiated action, sometimes to the tune of a dozen repos at once and 2500 requests per hour. That sounds like a crawler to me, and even if you want to argue the semantic meaning of the word "crawler," I strongly feel that robots.txt is the best available solution to inform the system what its rate limit should be.

kevincox 1255 days ago

When I mean crawler I mean something that discovers new pages. Refreshing the same URL isn't really crawling.

But yes, it may be the best available solution in this case, even if I would argue that it isn't really it's main purpose.

cmatthias 1255 days ago

After reading this and your response to a sibling comment I wholeheartedly disagree with you on both the specific definition of the word crawler and what the "main purpose" of robots.txt is, but glad we can agree that Google should be doing more to respect rate limits :)

ddevault 1255 days ago

What you're thinking about, in my opinion, is best referred to as a spider.

Arnavion 1255 days ago

As annoying as it is, there is precedent for this opinion with RSS aggregator websites like Feedly. They discover new feed URLs when their users add them, and then keep auto-refreshing them without further explicit user interaction. They don't respect robots.txt either.

kevincox 1255 days ago

I wouldn't expect or want an RSS aggregator to respect robots.txt for explicitly added feeds. That is effectively a human action asking for that feed to be monitored so robots.txt doesn't apply.

What would be good is respecting `Cache-Control`, which unfortunately many RSS clients don't, and just pick a schedule and poll on it.

Arnavion 1255 days ago

robots.txt was originally created to include such bots. That they think they don't need to respect it goes against the original intent.

Eg: https://www.robotstxt.org/faq/kinds.html >"What's New" monitoring

counttheforks 1255 days ago

I want my software to obey me, not someone else. If the software is discovering resources on its own, then obeying robots.txt is fair. But if the software is polling a resource I explicitly told it to, I would not expect it to make additional requests to fetch unrelated files such as a robots.txt

chillfox 1255 days ago

I can almost see both sides here... But ultimately when you are using someone else's resources, then not respecting their wishes (within reason) just makes you an asshole.

michaelcampbell 1255 days ago

Going up the stack a bit this feels to me like the same sort of "we know better" mentality that said no one really needs generics.

jslql 1255 days ago

Why should a git client respect an http standard such as robots.txt?

yamtaddle 1255 days ago

Google began pushing for it to become an Internet standard—explicitly to be applicable to any URI-driven Internet system, not just the Web—in 2019, and it was adopted as an Internet standard in 2022.

https://developers.google.com/search/blog/2019/07/rep-id

cmatthias 1255 days ago

This is true but irrelevant to the parent's question -- in the article, it's made clear that Google's requests are happening over HTTP, which is the most obvious reason why robots.txt should be respected.

yamtaddle 1255 days ago

It's relevant because it attacks the premise of their objection.

cmatthias 1255 days ago

Read the OP; it's obvious based on the references to robots.txt, the User-Agent header, returning a 429 response, etc, that most (all?) of Google's requests are doing git clones over http(s).

trulyrandom 1255 days ago

Because it uses HTTP.