|
|
|
|
|
by 1vuio0pswjnm7
351 days ago
|
|
"User-agent: CCBot disallow: /" Is Common Crawl exclusively for "AI" CCBot was already in so many robots.txt prior to this How is CC supposed to know or control how people use the archive contents What if CC is relying on fair use # To request permission to license our intellectual
# property andd/or other materials, please contact this
# site's operator directly
If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing feesIs it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee Is this fee shared with the rights holders |
|
Ironically, I've only ever scraped sites that block CCBot, otherwise I'd rather go to Common Crawl for the data.