| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jasonthorsness 347 days ago

I turned this on and it adjusts the robots.txt automatically; not sure what else it is doing.

# NOTICE: The collection of content and other data on this # site through automated means, including any device, tool, # or process designed to data mine or scrape content, is # prohibited except (1) for the purpose of search engine indexing or # artificial intelligence retrieval augmented generation or (2) with express # written permission from this site’s operator.

# To request permission to license our intellectual # property and/or other materials, please contact this # site’s operator directly.

# BEGIN Cloudflare Managed content

User-agent: Amazonbot Disallow: /

User-agent: Applebot-Extended Disallow: /

User-agent: Bytespider Disallow: /

User-agent: CCBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: GPTBot Disallow: /

User-agent: meta-externalagent Disallow: /

# END Cloudflare Managed Content User-agent: * Disallow: /* Allow: /$

8 comments

1vuio0pswjnm7 347 days ago

"User-agent: CCBot disallow: /"

Is Common Crawl exclusively for "AI"

CCBot was already in so many robots.txt prior to this

How is CC supposed to know or control how people use the archive contents

What if CC is relying on fair use

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing fees

Is it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee

Is this fee shared with the rights holders

ronsor 347 days ago

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

Scrapers don't accept the terms of service.

Ironically, I've only ever scraped sites that block CCBot, otherwise I'd rather go to Common Crawl for the data.

nemomarx 347 days ago

Read a tos and notice that you give the site operators unlimited license to reproduce or spread your works, almost on any site. it's required to host and show the content essentially

postalcoder 347 days ago

This is interesting. The reasoning and response don't line up.

  > Cloudflare is making the change to protect original content on the internet, Mr. Prince said. If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content, he said

  >  prohibited except for the purpose of [..] artificial intelligence retrieval augmented generation

This seems to be targeted at taxing training of language models, but why an exclusion for the RAG stuff? That seems like it has a much greater immediate impact for online content creators, for whom the bots are obviating a click.

fennecfoxy 347 days ago

With that opinion, are you also suggesting that we ban ad blockers? Because it's better I not click & consume resources than click and not be served ads, basically just costing the host money.

It means sense to allow for RAG in the same way that search engines provide a snippet of an important chunk of the page.

A blog author could not complain that their blog is getting ragged when they're extremely liable to be Google/whatever searching all day and basically consuming others' content in exactly the same way that they're trying to disparage.

ijk 347 days ago

What I want to know is if the flood of scraping everyone has been complaining about is coming from people trying to scrape for training or bots doing RAG search.

I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?

I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.

mattcollins 347 days ago

I wondered about this, too.

Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.

progmetaldev 347 days ago

I believe it's both. We're at a place where legislation hasn't really declared what is and isn't allowed. These scrapers are acting like Googlebot or any other search engine crawler, and trying to find any kind of new content that might be of value to their users.

New data is still being added online daily (probably hourly, if not more often) by humans, and the first ones to gain access could be the "winners," particularly if their users happen to need up to date data (and the service happens to have scraped it). Just like with search engines/crawlers, there's also the big players that may respect your website, but there are also those that don't use rate-limiting or respect robots.txt.

wiether 347 days ago

You should ask Zuck, since, for what we've seen and what we were ask to act against, Meta is the main culprit in scraping every single page of websites, multiple times a day.

And I'm talking about ecommerce websites, with their bot scraping every variation of each product, multiple times a day.

postalcoder 347 days ago

I don't think we should ban ad blockers, but I also think it's fair to suggest that the loss of organic traffic could be affecting the incentive to create new digital content, at least as much as the fear of having your content absorbed into an LLM's training data.

Boldened15 347 days ago

IMO the backlash against LLMs is more philosophical, a lot of people don’t like them or the idea of one learning from their content. Unless your website has some unique niche information unavailable anywhere else there’s no direct personal risk. RAG would be a more direct threat if anything.

toomuchtodo 347 days ago

It's really about who is getting the value from the work of the content. If content creators of all sorts have their work consumed by LLMs, and LLM orgs charge for it can capture all the value, why should people create to have their work vacuumed up for the robot's benefit? For exposure? You can't eat or pay rent with exposure. Humans must get paid, and LLMs (foundational models and output using RAG) cannot improve without a stream of works and data humans create.

Whether you call it training or something else is irrelevant, it's really exploitation of human work and effort for AI shareholder returns and tech worker comp (if those who create aren't compensated). And the technocracy has not been, based on the evidence, great stewards of the power they obtain through this. Pay the humans for their work.

o11c 347 days ago

It's not philosophical, it's economical.

AI scrapers increase traffic by maybe 10x (this varies per site) but provide no real value whatsoever to anyone. If you look at various forms of "value":

* Saying "this uses AI" might make numbers go up on the stock market if you manage to persuade people it will make numbers go up (see also: the market will remain irrational longer than you can remain solvent).

* Saying "this uses AI" might fulfill some corporate mandate.

* Asking AI to solve a problem (for which you would actually use the solution) allows you to "launder" the copyright of whatever source it is paraphrasing (it's well established that LLMs fail entirely if a question isn't found within their training set). Pirating it directly provides the same value, with significantly less errors/handholding.

* Asking AI to entertain you ... well, there's the novelty factor I guess, but even if people refuse to train themselves out of that obsession, the world is still far too full of options for any individual to explore them all. Even just the question of "what kind of ways can I throw some kind of ball around" has more answers than probably anyone here knows.

What am I missing?

robrenaud 347 days ago

Why are 100s of millions of people using AI if it is providing no value?

lxgr 347 days ago

More and more people use ChatGPT for search, so blocking that doesn't seem like a successful strategy long-term.

bee_rider 347 days ago

I wonder… Google scrapes for indexing and for AI, right? I wonder if they will eventually say: ok, you can have me or not, if you don’t want to help train my AI you won’t get my searches either. That’s a tough deal but it is sort of self-consistent.

mrweasel 347 days ago

Very few people seems to be complaining that Google crashes their sites. Google also publish their crawlers IP ranges, but you really don't need to rate-limit Google, they know how to back off and not overload sites.

Symbiote 347 days ago

In theory — in practise I've had to limit Google on two large sites at work. I currently have them limited to 10/s for non-cached requests.

progmetaldev 347 days ago

Curious if the content on those sites might have high value to Google? Such as if they have data that is new or unavailable elsewhere, or if they're just standard sites, and you've just been unlucky?

I have had odd bot behavior from some major crawlers, but never from Google. I wonder if there is a correlation to usefulness of content, or if certain sites get stuck in a software bug (or some other strange behavior).

Symbiote 346 days ago

Google do value the sites, they have data unavailable elsewhere. At some point we had an automated message saying the site had too many pages and would no longer be indexed, then a human message saying that was a mistake, and our site was an exception to that rule.

But as with any contact with these large companies, our contact eventually disappeared.

giancarlostoro 347 days ago

"Embrace, Extend, Extinguish" Google's mantra. And yes, I know about Microsoft's history with that phrase ;) But Google has done this with email, browsers (Google has web apps that run fine on Firefox but request you use Chrome), Linux (Android), and I'm sure there's others I am forgetting about.

So yeah, I too could see them doing this.

xyst 347 days ago

So in addition to updating the robots.txt file, which really only blocks a small number of them.

Seems CF has been gathering data and profiling these malicious agents.

This post by CF elaborates a bit further: https://blog.cloudflare.com/declaring-your-aindependence-blo...

Basically becomes a game of cat and mouse.

Bender 347 days ago

For my silly hobby sites I just return status 444 close the connection for anything that has case-insentive "bot" in the UA requesting anything other than robots.txt, humans.txt, favicon.ico, etc... This would also drop search engines but I blackhole route most of their CIDR blocks. I'm probably the only one here that would do this.

sneak 347 days ago

How does a bot scraping your silly hobby sites for any purpose harm or negatively affect you in any way?

pixl97 347 days ago

Depends if they hit a site enough to make it cost something. It's not hard for bots to flood servers.

Bender 346 days ago

Only if they push me over my bandwidth limits but they can't do that if I just drop them on the floor.

lxgr 347 days ago

That's at least a more reasonable default than that I've seen at least one newspaper do, which is to block both LLM scrapers and things like ChatGPT's search feature explicitly.

slenk 347 days ago

I thought I saw cloudflare insert noindex links?

swyx 347 days ago

what actually are the consequences of ignoring robots.txt (apart from DDOS)? have any of these cases ended up in court at all?

v5v3 346 days ago

BBC recently served a cease and desist on perplexity to stop, and delete all existing.

https://www.bbc.co.uk/news/articles/cy7ndgylzzmo

So an ai company can just be naughty till asked to stop, and then exclud that one company that has the financial resources to go legal.