Hacker News new | ask | show | jobs
by hipadev23 638 days ago
Companies have been trying and failing to prevent large scale crawling for 25 years. It’s a constant arms race and the scrapers always win.

The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.

2 comments

> The people that lose ...

are also everyone who makes (literally) any effort in the direction of digital privacy, whose internet experience is degraded and frustrating due to increasingly bad captchas or just outright refusal of service.

The people that lose are the ones left with bandwidth charges and overloaded servers.

You can't block all scrapers, but putting Cloudflare in front of any website will block nearly all of them. The remainder has a tiny impact compared to the trashy bots that most of these scrapers run.

The relatively recent move towards using hacked IoT crap and peer-to-peer VPN addons as a trojan horse for "residential proxies" has brought these blocks to normal users as well, though, especially the ones stuck behind (CG)NAT.

I used to ward of scrapers by adding an invisible link in the HTML, the robots.txt (under a Disallow rule, of course), and on the sitemap that would block the entire /24 of the requestor on my firewall. Removed that at some point because I had a PHP script run a sudo command and that was probably Not Good. Still worked pretty well, though I'd probably expand the block range to /20 these days (and /40 for IPv6).