Hacker News new | ask | show | jobs
by stubish 22 days ago
If you use Cloudflare to proxy your site, there is a button to click that blocks the AI crawlers (even the free tier). It is almost as if the AI crawlers are a DDoS attack. You can't really do it any other way, since many don't respect robots.txt. At least until someone comes up with crowdsourced blacklists with few false positives.
1 comments

"You can't really do it any other way"

Any custom solution by a half-competent programmer filters out all web crawlers. I'm running a semi-public website for years and nothing gets past

Yeah, I feel like unless you run a site large enough for google monkeys to write a special case for your site specifically, why not just password protect the entire site but put the password on the login page? Or any other rudimentary captcha I suppose - like the old days.

Doesn't keep out anyone even mildly interested in your site specifically, including scrapers, but at least it blocks googlebot etc.

Funny edge case when you can’t read the password because you need it for access
You have heuristics, blacklists and captures. Anything else to add? Those three can all turn away legitimate traffic from public sites. Spambots have been pretending to be legitimate users for decades, and they tend to be pretty dumb. Cloudflare and other large hosts get to do heuristics pretty well, as they can aggregate data from millions of sites rather than the few an individual might run. And even they block and force captures on legitimate users, per complaints you hear here regularly.