| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 1vuio0pswjnm7 351 days ago

"User-agent: CCBot disallow: /"

Is Common Crawl exclusively for "AI"

CCBot was already in so many robots.txt prior to this

How is CC supposed to know or control how people use the archive contents

What if CC is relying on fair use

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials for use in creating LLMs and collect licensing fees

Is it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee

Is this fee shared with the rights holders

2 comments

ronsor 351 days ago

   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly

Scrapers don't accept the terms of service.

Ironically, I've only ever scraped sites that block CCBot, otherwise I'd rather go to Common Crawl for the data.

link

nemomarx 351 days ago

Read a tos and notice that you give the site operators unlimited license to reproduce or spread your works, almost on any site. it's required to host and show the content essentially

link