| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yazzku 593 days ago
	tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.

3 comments

dartos 593 days ago

Should we webmasters just start blocking user agents wholesale?

I mean except known good actors.

I guess known actors would need a verifiable signature

link

rty32 593 days ago

Not viable. They are going to use user agents that look like those coming from completely normal human users.

"Verifiable signature"? That's a dangerous road to go down, and Google actually wanted to do it (Web Integrity API). Nobody supported them and they backed out.

link

jsheard 593 days ago

Search engine crawlers do have verifiable signatures, if a client claims to be Googlebot or Bingbot you don't have to take their word for it.

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

link

yazzku 593 days ago

But the converse is not true? There is no guarantee the crawler is not amassing data for model training, or that a crawler (AI or otherwise) does not disguise itself as a normal user?

link

jsheard 593 days ago

Yeah, but traffic appearing to come from normal users can be throttled and/or CAPTCHA'ed while still allowing Google and Bing to crawl to their hearts content so your SEO isn't affected.

link

SoftTalker 593 days ago

I would think rate-limiting would be good. Crawlers are not patient enough to operate at the speed of a real human user.

link

readyplayernull 593 days ago

Greedy crawlers will use fake user-agent strings.

link

Narhem 593 days ago

It’s relatively simple to detect crawlers writing one from scratch could take a few weeks if the infrastructure was in place.

With salaries though finding an externally managed solution might be cheaper.

link

andrethegiant 593 days ago

[Shameless plug] I'm building a platform[1] that abides by robots.txt, crawl-delay directive, 429s, Retry-After response header, etc out of the box. Polite crawling behavior as a default + centralized caching would decongest the network and be better for website owners.

[1] https://crawlspace.dev

link