Hacker News new | ask | show | jobs
by stevefeinstein 3246 days ago
So again someone wants to punish all the legitimate people using a web site to get some marginal benefit from detecting the remaining <1%. The inevitable false positives don't affect the "malicious" users. Only the legitimate ones. And how much will this bloat the page load by? Adding more code to an already overly large page isn't helping anyone.

Just let the web be the web, and stop trying to control it.

2 comments

Mentioned this in another comment, but for some websites, the scraping problem has real costs. Airline, hotel, stock prices,etc. For some spaces, scaling and paying bandwidth for unconstrained scraping is costly. And not restricting it hurts the legitimate users because the performance sucks.

There are also the scrapers blindly looking for vulnerabilities or other unsavory tactics.

While I agree with you to a degree most airlines and hotels have APIs that can be consumed to get pricing information there are just restrictions in what you can do with that information.

Not sure about stock prices (I think it's pretty common to pay for real time data there?).

But I can certainly see sites that have a lot of data for their users facing major bandwidth costs if a lot of people were scraping their data. This type of detection isn't really an answer for that, though, as it's easy to mitigate for a scrapper.

Also, what I've learned is how little regard for your site your scrapers often have, scraping as aggressively as possible.

You're just not always in a place to scale to the abuse or build something more complex than some simple heuristic filters.

> what I've learned is how little regard for your site your scrapers often have, scraping as aggressively as possible.

Often? Based on what data?

I find it much more likely you only often notice aggressive scrapers. That however tells you nothing about the behavior of the average web scraper or web scrapers in general.

The system encourages it. Ingress data is cheap, and so many scrapers just default to high frequency.
You're not taking into consideration the economical consequences some bots have on companies. Some bots are designed to make payments with fake or stolen credit cards. Some bots impact on people that need to manually check for submissions or takedown notices. Obviously I agree that there are legitimate use for bots and scrapers, but that admittedly low percentage of fraudulent use cases do cause a lot of harm.
Some bot can and will use a real browser, in a real window, opening many sessions, randomising usage patterns to look more human, and continue doing what they already do.
if a human is allowed to do something on a site, it goes to reason that a bot should be allowed too (granted using the same access frequency as a human).

Blocking scraping is like DRM. Don't do it. Use a legal mechanism to deal with copyright infringement, and use acceptable usage policy to deal with heavy users that are using more than their "fair" share of bandwidth.

Some people just hire actual humans to do this, for cheap.