Hacker News new | ask | show | jobs
by data-ottawa 31 days ago
reCaptcha is a pretty strong wall to allow only Google to index websites, especially now that you need device verification. Throw in Cloudflare too.

There’s not much room to squeeze in when your competitors hold the keys to 15 million top websites.

2 comments

I write a lot of scrapers. Both of those are pretty trivial to bypass at scale.
What about not at scale?

I find it wild that "at scale" we can bypass anti-bot measures, but just "normal" internet use (i.e Non-Google Browser or VPN) will throw a million captchas at you.

cgnat is pretty bad too.

Not at scale, what you’re seeing are a tiny tiny fraction of the potential captchas that can be thrown at you. Normally “we have seen this cookie before”, or “this browser does not have webdriver fingerprints” is sufficient to not get a captcha.

The big issue you sidestep not at scale is you can come from a single, residential IP with a good reputation.

Mandatory captchas for simply viewing a page are rare - most are saved for high impact actions like account creation.

When this does happen for a simple page view, AI is extremely good at solving basic captchas - especially basic “click the box” captchas.

If you don’t want to pay for AI, there are decaptcha services where someone in Southeast Asia solves the captcha for fractions of a penny. Save the cookies after a successful solve and you’re probably good in the future.

If you don’t want to pay for someone to solve a simple check the box captcha a little bit of attention and some properly simulated clicking (IE not a JavaScript injected event) will often work. Just don’t click literally the exact middle, fuzz the coordinates and you’re good.

> reCaptcha is a pretty strong wall to allow only Google to index websites

Why would website authors _want_ to prevent crawling by other search engines?

Because there's been a string of bad actors including OpenAI with incredibly inefficient scrapers.

Previously captcha was just for spam limiting, but I actually looked at our system logs and about half of traffic was bad behaving scrapers.

In logs I see these scrapers are hitting every link on the page. If you have a collection page then it's hitting every filter option and then hitting each pagination button, the different sort orders, etc. People running something like Forgejo it will hit every commit.

If you have expensive to compute pages, they're getting hit by these incredibly naive bots that don't respect any robots.txt or discriminate on what they do.