|
|
|
|
|
by Spivak
616 days ago
|
|
We have billions of people, we can accomplish two maybe three things at a time. This is a valid use as any of that archived data. The part that sucks isn't that people are doing unusual things with it like training AI, but that copyright & capitalism make it so that everyone has to go get their own data themselves to the annoyance of web admins. The biggest technical hurdle to sharing the work among interested parties is the web only authenticates the pipe, not the content. |
|
"Our goal is to democratize the data so that everyone, not just big companies, can do high-quality research and analysis."
Because they share it openly including with those doing AI, they wind up on "AI crawler" lists, which are increasingly used by blocking tools that just "use the AI list", by people who don't like AI, or, quite ironically, people who are trying to prevent the excess traffic that poorly mannered AI crawlers cause. (Common Crawl's crawler is well mannered, uses good user-agent, respects robots.txt including crawl-delay, etc)
https://commoncrawl.org/