Hacker News new | ask | show | jobs
by efilife 31 days ago
"You can't really do it any other way"

Any custom solution by a half-competent programmer filters out all web crawlers. I'm running a semi-public website for years and nothing gets past

2 comments

Yeah, I feel like unless you run a site large enough for google monkeys to write a special case for your site specifically, why not just password protect the entire site but put the password on the login page? Or any other rudimentary captcha I suppose - like the old days.

Doesn't keep out anyone even mildly interested in your site specifically, including scrapers, but at least it blocks googlebot etc.

Funny edge case when you can’t read the password because you need it for access
You have heuristics, blacklists and captures. Anything else to add? Those three can all turn away legitimate traffic from public sites. Spambots have been pretending to be legitimate users for decades, and they tend to be pretty dumb. Cloudflare and other large hosts get to do heuristics pretty well, as they can aggregate data from millions of sites rather than the few an individual might run. And even they block and force captures on legitimate users, per complaints you hear here regularly.