Hacker News new | ask | show | jobs
by einpoklum 32 days ago
> reCaptcha is a pretty strong wall to allow only Google to index websites

Why would website authors _want_ to prevent crawling by other search engines?

1 comments

Because there's been a string of bad actors including OpenAI with incredibly inefficient scrapers.

Previously captcha was just for spam limiting, but I actually looked at our system logs and about half of traffic was bad behaving scrapers.

In logs I see these scrapers are hitting every link on the page. If you have a collection page then it's hitting every filter option and then hitting each pagination button, the different sort orders, etc. People running something like Forgejo it will hit every commit.

If you have expensive to compute pages, they're getting hit by these incredibly naive bots that don't respect any robots.txt or discriminate on what they do.