|
|
|
|
|
by nickpsecurity
565 days ago
|
|
They appear to use Common Crawl in the DCLM dataset. Just downloading Common Crawl is probably copyright infringement before we consider specific terms in the licenses. Arxiv papers have a mix of licenses with some not allowing commercial use. If I got the sources right, it’s already illegal with just two sources they scraped. That’s why I want one on Gutenberg content that has no restrictions. |
|