Hacker News new | ask | show | jobs
by creatonez 637 days ago
This seems like a gimmick. Isn't preventing crawling a sisyphean task? The only real difference this will make is further entrenching big players who have already crawled a ton of data. And if this feature comes at the cost of false positives and overbearing captchas, it will start to affect users.
4 comments

Companies have been trying and failing to prevent large scale crawling for 25 years. It’s a constant arms race and the scrapers always win.

The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.

> The people that lose ...

are also everyone who makes (literally) any effort in the direction of digital privacy, whose internet experience is degraded and frustrating due to increasingly bad captchas or just outright refusal of service.

The people that lose are the ones left with bandwidth charges and overloaded servers.

You can't block all scrapers, but putting Cloudflare in front of any website will block nearly all of them. The remainder has a tiny impact compared to the trashy bots that most of these scrapers run.

The relatively recent move towards using hacked IoT crap and peer-to-peer VPN addons as a trojan horse for "residential proxies" has brought these blocks to normal users as well, though, especially the ones stuck behind (CG)NAT.

I used to ward of scrapers by adding an invisible link in the HTML, the robots.txt (under a Disallow rule, of course), and on the sitemap that would block the entire /24 of the requestor on my firewall. Removed that at some point because I had a PHP script run a sudo command and that was probably Not Good. Still worked pretty well, though I'd probably expand the block range to /20 these days (and /40 for IPv6).

The risk of getting sued prevents companies from using pirated software.

The big players might just pay the fee because they might one day need to prove where they got the data from.

My website contains millions of pages. It's not hard to notice the difference between a bot (or network) that wants to access all pages and a regular user.
Oh you will not notice. The pages can easily be spread out between residential IPs using headless browsers (masked as real ones), unless you really pay attention you won't see the ones that want to hide.
Every single argument against Cloudflare's features highlights exactly why people use Cloudflare so much.

You're talking about people setting up a botnet in order to scrape every scrap of data they can off of every website they touch. Why on earth would anyone be okay with such parasitic behaviors?

That's the thing, CF ain't gonna protect you against that. You need to consider actual access controls to actually restrict access.

Otherwise you're blaming people of using the data you've published, so what if they do?

How many scrapers are sophisticated enough to go this far though? Most of them are probably of bad quality and can be detected.
Why would those sophisticated enough to go that far, be of low quality
Unless they are scraping it using residential botnet proxies, unique user-agents, unique device types, and etc...
How often are the bots indexing it?
If you listen to the people complaining about bots at the moment, some bots are scraping the same pages over and over to the tune of terabytes per day because the bot operators have unlimited money and their targets don't.
> because the bot operators have unlimited money

I rather think the cause is that inbound bandwidth is usually free, so they need maybe 1/100th of the money because requests are smaller than responses (plus discounts they get for being big customers)

> I rather think the cause is that inbound bandwidth is usually free, so they need maybe 1/100th of the money because requests are smaller than responses (plus discounts they get for being big customers)

Seems like there's the potential to take advantage of this for a semi-custom protocol, if there's a desire to balance costs for serving data while still making things available to end users. We'd have the server reply to the initial request with a new HTTP response instructing the client to re-request with a POST containing an N-byte (N = data size) one-time pad. The client can receive this, generate random data (or all zeros, up to the client); and the server then will send the actual response XOR'd with the one-time pad.

Upside: Most end users don't pay for upload; if bot operators do, this incurs a dollar cost only to them. Downside: Increased download cost for the web site operator (but we've postulated that this is small compared to upload cost), extra round trip, extra time for each request (especially for end users with asymmetric bandwidth).

Eh, just a thought.

May work for small pages, like most of my webpages besides some downloadable files, but megabytes of JavaScript on an average (mobile?) connection are going to take very significantly longer to load, cost more battery, and take twice as much from your data bundle

Perhaps it's effective as bot deterrent when someone incurs, say, a ten times higher than median load (as measured in something like CPU time per hour or bandwidth per week or so). It will not prevent anyone from seeing your pages so information is still free, but it levels the playing field -- at least, for those with free inbound bandwidth dealing with bots that pay for outgoing bandwidth

> because the bot operators have unlimited money and their targets don't.

wget/curl vs django/rails, who wins?

> The only real difference this will make is further entrenching big players

It's the opposite. Only big players like google get meetings with big publishers and copyright holders to be individually whitelisted in robots.txt. Whereas a marketplace is accessible to any startup or university.