Hacker News new | ask | show | jobs
by hibikir 693 days ago
The idea is not to make scraping impossible, but to make it expensive. A human doesn't make requests as fast as a bot, so the pretend human is still rate limited. Eventually, you need an account, and tracking of that also happens, and accounts matching specific patterns get purged, and so on. This will not stop scraping, but the point is not to stop it, but to make it expensive and slow. Eventually, expensive enough that it might be better off to not pretend to be a human, pay for a license, and then the arms race goes away.

Can defenses be good enough it's better to not even try to fight? It's a far harder question than wondering if a random bot can make a dozen requests pretending to be human

2 comments

I liked the analogy to Gabe Newell's "piracy is a service problem" adage, embodied in Virgin API consumer vs Chad third-party scraper https://x.com/gf_256/status/1514131084702797827

Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option.

Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask.

I think this very example proves that the adage is wrong, or at least doesn't capture many things for the full picture.
Well, it isn't a case of piracy, is it? The data exists on the website, for free, under the assumption/social contract that you are a human, not an agent of a shady enterprise wasting the bandwidth. An analogy would be the game itself being put out for free on itch.io, but then downloaded and unpacked to make an asset flip.
Ironic of him to say that seeing that it’s often easier to pirate Steam gates than to get them legitimately.
The only way I can see to make it truly expensive to scrape, is to build javascript bitcoin mining into every request.