| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jhpacker 594 days ago
	Cloudflare radar, which presumably a much bigger and better sample, reports Bytespider as the #5 AI Crawler behind FB, Amazon, GPTBot, and Google: https://radar.cloudflare.com/explorer?dataSet=ai.bots And that's not including the most of highest volume spiders overall like Googlebot, Bingbot, Yandex, Ahrefs, etc. Not to say it isn't an issue, but that Forture article they reference is pretty alarmist and thin on detail.

1 comments

jsheard 594 days ago

The difference is that, AFAIK, those bigger AI crawlers do respect robots.txt. Google even provides a way to opt-out of AI training without opting-out of search indexing.

link

yazzku 594 days ago

And how much do you trust that shit? Has anyone set up a honeypot as an experiment?

link

BXlnt2EachOther 594 days ago

possibly unpopular opinion, I trust the bigger companies more than small ones on stuff like this. It would be so much easier to not offer anything, rather than intentionally create a potemkin setting and risk the blowback that would occur if discovered. Hopefully this comment does not age poorly.

full disclosure: worked there [edit: google] a while ago, not in search, not in AI.

link

Arnt 594 days ago

You can trust Google to do what it says, and yes I've seen Google obey robots.txt. You can't trust Google to do what you think is right.

link

yazzku 593 days ago

No, you can't: https://apnews.com/article/828aefab64d4411bac257a07c1af0ecb

link

Arnt 592 days ago

I'm a bit in a hurry, don't have time for close reading. Does that article say some Google apps (notably Maps) store locations on your device even if you have configured them to not store it in your Google account? I may miss something, don't have time to read between the lines today.

link