| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dazc 765 days ago

robots.txt isn't an ideal way of preventing pages being indexed.

X-robots http headers are more reliable: https://developers.google.com/search/docs/crawling-indexing/...

Regarding AI, it's a bit more tricky since it isn't going to abide by your rules. Cloudflare have tools: https://blog.cloudflare.com/ai-bots/

How effective these are though, IDK?

1 comments

lucas_crocker 765 days ago

Oh that's interesting. Yea, paywalls might end being the most effective way of preventing AI bots, but even those could be circumvented. It's a tricky problem for sure.

link