Hacker News new | ask | show | jobs
by boristsr 641 days ago
I'm pretty interested in how companies are exploring how to properly monetize or compensate for scraped content to help keep a strong ecosystem of quality content. Id love to see more efforts like this.
4 comments

There's a HTTP code for charging for access: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/402

Then there's a Lightning Network protocol for it: https://docs.lightning.engineering/the-lightning-network/l40...

With the Cloudflare stuff, it just seems like an excuse to sell Cloudflare services (and continue to force everyone to use it) as opposed to just figuring out a standard way of using what is already built to provide access for some type of micropayment.

The problem is that soft technical measures like HTTP 402 and robots.txt aren't legally binding, so there's nothing stopping scrapers from just ignoring them. Cloudflares value proposition here is they will play the cat-and-mouse game of detecting things like spoofed user agents and residential proxies on your behalf, and actively block what appears to be scraper traffic unless they pay up.

Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics.

Sure it's not legally binding, but if I see >100000 requests coming from 1 IP address within a week, I'm also not legally bound to make that 402 error go away. By having an automated payment mechanism, the two parties could come to an agreement they're both happy about

> there's nothing stopping scrapers from just ignoring them

Feel free to ignore HTTP errors, but those pages don't contain the content you're looking for

(For the record, I don't use HTTP 402, but I noncommercially host stuff and know what bots people are complaining about.)

I mean it's not legally binding in the sense that if you start sending 402s or 403s to a scraper it can just take that as a signal to try again from a different IP address until it works - your servers clearly stated intent that the bot should pay up or go away isn't legally actionable. With enough effort you can chase the bots until they run out of resources, but few people have time to win that battle by themselves, hence delegating it to Cloudflare or similar.
"Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics"

Yeah. You can't have it both ways. Similar dilemma for requiring identification vs disallowing immigrants.

Companies have been trying to find novel ways to bypass fair use / public domain laws for a long time.

Each time they do, we see more consolidation of the media, and lower pay for the people that produce the content.

I don’t see why this particular effort will turn out differently.

I wonder if there's a way to test this hypothesis. Does content being freely reproducible with minor modification increase the demand for content creators since new content is more valuable than the existing that can be copied.

I'd guess that since AI can fair-useify a work faster than any human, that fair-use reviewers, compilers/collagers, re-imaginers, etc content creators will be devalued.

However, AIs are as yet unable to create work as innovative as humans. Therefore new work should be more valuable since now there is demand from people and AIs for their work. I'm assuming that AI companies pay for the work that they use in some way. Hopefully the aggregation sites continue to compete for content creators.

> "I'm assuming that AI companies pay for the work that they use in some way."

That mistaken assumption is at the heart of the problem under discussion.

> help keep a strong ecosystem of quality content

To the extent quality content does exist online: what isn't either already behind a paywall, or created by someone other than who will be compensated under such a scheme?

This won't work. If you are doing an AI startup, you will want to use GoogleBot for your crawler and this will bypass that.

Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.

Edit: updated comment to not be needlessly diversive.

It is trivial to detect fake GoogleBot traffic (Google provides ways to validate it) and Cloudflare already does so. See for yourself:

  curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com
They'll immediately flag the request as malicious and return 403 Forbidden, even if your IP address is otherwise reputable.
Now try it from a google cloud vm.
Pretty sure that won't work, they let you validate whether an IP address is used by GoogleBot specifically, not just owned by Google in general. I doubt they are foolish enough to use the same pool of IP addresses for their internal crawlers and their public cloud.

https://developers.google.com/search/docs/crawling-indexing/...

It depends how the site has implemented it, a huge number just look for AS origination and *googleuserconent.com