Hacker News new | ask | show | jobs
by jhpacker 594 days ago
Cloudflare radar, which presumably a much bigger and better sample, reports Bytespider as the #5 AI Crawler behind FB, Amazon, GPTBot, and Google: https://radar.cloudflare.com/explorer?dataSet=ai.bots And that's not including the most of highest volume spiders overall like Googlebot, Bingbot, Yandex, Ahrefs, etc.

Not to say it isn't an issue, but that Forture article they reference is pretty alarmist and thin on detail.

1 comments

The difference is that, AFAIK, those bigger AI crawlers do respect robots.txt. Google even provides a way to opt-out of AI training without opting-out of search indexing.
And how much do you trust that shit? Has anyone set up a honeypot as an experiment?
possibly unpopular opinion, I trust the bigger companies more than small ones on stuff like this. It would be so much easier to not offer anything, rather than intentionally create a potemkin setting and risk the blowback that would occur if discovered. Hopefully this comment does not age poorly.

full disclosure: worked there [edit: google] a while ago, not in search, not in AI.

You can trust Google to do what it says, and yes I've seen Google obey robots.txt. You can't trust Google to do what you think is right.
I'm a bit in a hurry, don't have time for close reading. Does that article say some Google apps (notably Maps) store locations on your device even if you have configured them to not store it in your Google account? I may miss something, don't have time to read between the lines today.