Hacker News new | ask | show | jobs
by yazzku 593 days ago
tl;dr the crawlers do not respect robots.txt or the user agent anymore, but you can drop big bucks on the enterprise HA offering to stop them through other means.
3 comments

Should we webmasters just start blocking user agents wholesale?

I mean except known good actors.

I guess known actors would need a verifiable signature

Not viable. They are going to use user agents that look like those coming from completely normal human users.

"Verifiable signature"? That's a dangerous road to go down, and Google actually wanted to do it (Web Integrity API). Nobody supported them and they backed out.

Search engine crawlers do have verifiable signatures, if a client claims to be Googlebot or Bingbot you don't have to take their word for it.

https://developers.google.com/search/docs/crawling-indexing/...

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3...

But the converse is not true? There is no guarantee the crawler is not amassing data for model training, or that a crawler (AI or otherwise) does not disguise itself as a normal user?
Yeah, but traffic appearing to come from normal users can be throttled and/or CAPTCHA'ed while still allowing Google and Bing to crawl to their hearts content so your SEO isn't affected.
I would think rate-limiting would be good. Crawlers are not patient enough to operate at the speed of a real human user.
Greedy crawlers will use fake user-agent strings.
It’s relatively simple to detect crawlers writing one from scratch could take a few weeks if the infrastructure was in place.

With salaries though finding an externally managed solution might be cheaper.

[Shameless plug] I'm building a platform[1] that abides by robots.txt, crawl-delay directive, 429s, Retry-After response header, etc out of the box. Polite crawling behavior as a default + centralized caching would decongest the network and be better for website owners.

[1] https://crawlspace.dev