Hacker News new | ask | show | jobs
by aorth 360 days ago
Why is it useless and harmful? Many of us are struggling—without massive budgets or engineering teams—to keep services up due to incredible load from scrapers in recent years. We do use rate limiting, but scrapers circumvent it with residential proxies and brute force. I often see concurrent requests from hundreds or thousands of IPs in one data center. Who do these people think they are?
2 comments

Residential proxy users are paying on the order of $5 per gigabyte, so send them really big files once detected. Or "click here to load the page properly" followed by a trickle of garbage data.
There is no real way to confidently tell if someone using a residential proxy.
Once you spot a specific pattern you can detect that pattern.
It is harmful because innocent users routinely get caught in your dragnet. And why even have a public website if the goal is not to serve it?

What is the actual problem with serving users? You mentioned incredible load. I would stop using inefficient PHP or JavaScript or Ruby for web servers. I would use Go or Rust or a comparable efficient server with native concurrency. Survival always requires adaptation.

How do you know that the alleged proxies belong to the same scrapers? I would look carefully at the values contained in the IP chain as determined by XFF to know which subnets to rate-limit as per their membership in the XFF.

Another way is to require authentication for expensive endpoints.