| I turned this on and it adjusts the robots.txt automatically; not sure what else it is doing. # NOTICE: The collection of content and other data on this
# site through automated means, including any device, tool,
# or process designed to data mine or scrape content, is
# prohibited except (1) for the purpose of search engine indexing or
# artificial intelligence retrieval augmented generation or (2) with express
# written permission from this site’s operator. # To request permission to license our intellectual
# property and/or other materials, please contact this
# site’s operator directly. # BEGIN Cloudflare Managed content User-agent: Amazonbot
Disallow: / User-agent: Applebot-Extended
Disallow: / User-agent: Bytespider
Disallow: / User-agent: CCBot
Disallow: / User-agent: ClaudeBot
Disallow: / User-agent: Google-Extended
Disallow: / User-agent: GPTBot
Disallow: / User-agent: meta-externalagent
Disallow: / # END Cloudflare Managed Content
User-agent: *
Disallow: /*
Allow: /$ |
Is Common Crawl exclusively for "AI"
CCBot was already in so many robots.txt prior to this
How is CC supposed to know or control how people use the archive contents
What if CC is relying on fair use
If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing feesIs it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee
Is this fee shared with the rights holders