Hacker News new | ask | show | jobs
by dazc 765 days ago
robots.txt isn't an ideal way of preventing pages being indexed.

X-robots http headers are more reliable: https://developers.google.com/search/docs/crawling-indexing/...

Regarding AI, it's a bit more tricky since it isn't going to abide by your rules. Cloudflare have tools: https://blog.cloudflare.com/ai-bots/

How effective these are though, IDK?

1 comments

Oh that's interesting. Yea, paywalls might end being the most effective way of preventing AI bots, but even those could be circumvented. It's a tricky problem for sure.